TL:DR, (Why) is one of the terms in the expectation not derived properly?
Relative entropy policy search or REPS is used to optimize a policy in an MDP. The update step is limited in the policy space (?) by the KL-divergence metric to stabilize the update. Based on the KL-divergence constraints, and some constraints about the definition of a policy, we can derive its Lagrangian, and its dual optimization problem afterwards. And lastly, we find the appropriate update step (delta) by solving the dual problem.
However, I think we can also use it to find multiple optimal solutions in an optimization problem, just like (CMA)-evolutionary strategy algorithm.
So, based on the original paper and a section of REPS in this paper, I'm trying to derive the dual problem.
Suppose that we're finding set of solutions represented as a parametrized distribution $\pi(x|\theta)$ that maximizes $H(x)$. Suppose that the last parameters we came up with is denoted as $\hat{\theta}$, we find the optimal parameters $\theta$ by:
max $\int_x H(x)\pi(x|\theta) dx$
s.t. $\int_x \pi(x|\theta) dx = 1$
$D_\text{KL}\left(\pi(.|\theta) || \pi(.|\hat{\theta})\right) \leq \epsilon $
with $D_\text{KL}\left(\pi(.|\theta) || \pi(.|\hat{\theta})\right) = \int_x \pi(x|\theta)\log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta})}$
Based on the equations above, we can write the Lagrangian as follows:
$L(\theta, \lambda, \eta) = \int_x H(x)\pi(x|\theta) dx + \lambda(1-\int_x \pi(x|\theta) dx) + \eta(\epsilon-\int_x \pi(x|\theta)\log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta})})$
Now, we can see that the term $\lambda(1-\int_x \pi(x|\theta) dx)$ is $0$, right? But here, it was not cancelled out. So, following the flow based on the two papers, We can simplify the Lagrangian by treating the integral wrt to $x$ as an expectation.
$L(\theta, \lambda, \eta) = \lambda + \eta\epsilon + \underset{\pi(x|\theta)}{\mathbb{E}}\left[H(x) -\lambda -\eta \log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta})} \right]$
We will find the optimal $\pi(x|\theta)$ by solving $\frac{\partial L}{\partial \pi(x|\theta)} = 0$. Now, I got confused starting from this step. If I mindlessly copy/follow the notations from here, the derivative of $L$ wrt the policy parametrized by $\theta$ is:
$\frac{\partial L}{\partial \pi(x|\theta)} = H(x) - \lambda - \eta \log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta}}$
Where's the integral wrt $x$ goes? Is it because they are all multiplied by $\pi(x|\theta)$, so that it can cancel the integral/the expectation? If so, then why the derivative of the KL term in the expectation derived into this $\eta \log\frac{\pi(x|\theta)}{\pi(x|\hat{\theta})}$? Isn't the $\pi(x|\theta)$ in the log will derive something more?