How to derive the dual function step by step in relative entropy policy search (REPS)?

Question

TL:DR, (Why) is one of the terms in the expectation not derived properly?

Relative entropy policy search or REPS is used to optimize a policy in an MDP. The update step is limited in the policy space (?) by the KL-divergence metric to stabilize the update. Based on the KL-divergence constraints, and some constraints about the definition of a policy, we can derive its Lagrangian, and its dual optimization problem afterwards. And lastly, we find the appropriate update step (delta) by solving the dual problem.

However, I think we can also use it to find multiple optimal solutions in an optimization problem, just like (CMA)-evolutionary strategy algorithm.

So, based on the original paper and a section of REPS in this paper, I'm trying to derive the dual problem.

Suppose that we're finding set of solutions represented as a parametrized distribution $\pi(x|\theta)$ that maximizes $H(x)$. Suppose that the last parameters we came up with is denoted as $\hat{\theta}$, we find the optimal parameters $\theta$ by:

max $\int_x H(x)\pi(x|\theta) dx$

s.t. $\int_x \pi(x|\theta) dx = 1$

$D_\text{KL}\left(\pi(.|\theta) || \pi(.|\hat{\theta})\right) \leq \epsilon $

with $D_\text{KL}\left(\pi(.|\theta) || \pi(.|\hat{\theta})\right) = \int_x \pi(x|\theta)\log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta})}$

Based on the equations above, we can write the Lagrangian as follows:

$L(\theta, \lambda, \eta) = \int_x H(x)\pi(x|\theta) dx + \lambda(1-\int_x \pi(x|\theta) dx) + \eta(\epsilon-\int_x \pi(x|\theta)\log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta})})$

Now, we can see that the term $\lambda(1-\int_x \pi(x|\theta) dx)$ is $0$, right? But here, it was not cancelled out. So, following the flow based on the two papers, We can simplify the Lagrangian by treating the integral wrt to $x$ as an expectation.

$L(\theta, \lambda, \eta) = \lambda + \eta\epsilon + \underset{\pi(x|\theta)}{\mathbb{E}}\left[H(x) -\lambda -\eta \log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta})} \right]$

We will find the optimal $\pi(x|\theta)$ by solving $\frac{\partial L}{\partial \pi(x|\theta)} = 0$. Now, I got confused starting from this step. If I mindlessly copy/follow the notations from here, the derivative of $L$ wrt the policy parametrized by $\theta$ is:

$\frac{\partial L}{\partial \pi(x|\theta)} = H(x) - \lambda - \eta \log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta}}$

Where's the integral wrt $x$ goes? Is it because they are all multiplied by $\pi(x|\theta)$, so that it can cancel the integral/the expectation? If so, then why the derivative of the KL term in the expectation derived into this $\eta \log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta})}$? Isn't the $\pi(x|\theta)$ in the log will derive something more?

Brale · Accepted Answer · 2022-01-03T10:51:45.577

What you did is incorrect and that's not what authors got either (if you're refering to the equation above equation (5) in paper "Non-parametric Policy Search with Limited Information Loss")

What you need here is the derivative of a functional. Functional has a general form \begin{equation} J(f) = \int_x L(x, f(x), f'(x), \ldots, f^{(n)}(x)) dx \end{equation}

Derivative with respect to $f$ is \begin{equation} \frac{\partial J}{\partial f} = \frac{\partial L}{\partial f} - \frac{d}{dx}\frac{\partial L}{\partial f'} + \ldots \end{equation}

Since your functional (expectation) is only dependant on $f$ then it simplifies to $\frac{\partial J}{\partial f} = \frac{\partial L}{\partial f}$

So what you have for example for the first term is \begin{equation} \frac{\partial \int_x H(x) \pi(x|\theta) dx}{\partial \pi(x|\theta)} = \frac{\partial H(x) \pi(x|\theta)}{\partial \pi(x|\theta)} = H(x) \end{equation}

The more interesting functional is the one with term $\pi(x|\theta)\log(\frac{\pi(x|\theta)}{\pi(x|\hat \theta)})$. Integral with respect to $\pi(x|\theta)$ is \begin{equation} \log(\frac{\pi(x|\theta)}{\pi(x|\hat \theta)}) + 1 \end{equation}

So in the end you get \begin{equation} \frac{\partial L}{\partial \pi(x|\theta)} = H(x) - \lambda - \eta \log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta})} - \eta \end{equation}

thanks for pointing that out, I will try to derive it until I get the dual. I think my confusion arises from not knowing/understanding the functional derivative. — Sanyou, Jan 02 '22 at 17:51

How to derive the dual function step by step in relative entropy policy search (REPS)?

1 Answers1