Why do the standard and deterministic Policy Gradient Theorems differ in their treatment of the derivatives of $R$ and the conditional probability?

Question

I would like to understand the difference between the standard policy gradient theorem and the deterministic policy gradient theorem. These two theorem are quite different, although the only difference is whether the policy function is deterministic or stochastic. I summarized the relevant steps of the theorems below. The policy function is $\pi$ which has parameters $\theta$.

Standard Policy Gradient $$ \begin{aligned} \dfrac{\partial V}{\partial \theta} &= \dfrac{\partial}{\partial \theta} \left[ \sum_a \pi(a|s) Q(a,s) \right] \\ &= \sum_a \left[ \dfrac{\partial \pi(a|s)}{\partial \theta} Q(a,s) + \pi(a|s) \dfrac{\partial Q(a,s)}{\partial \theta} \right] \\ &= \sum_a \left[ \dfrac{\partial \pi(a|s)}{\partial \theta} Q(a,s) + \pi(a|s) \dfrac{\partial}{\partial \theta} \left[ R + \sum_{s'} \gamma p(s'|s,a) V(s') \right] \right] \\ &= \sum_a \left[ \dfrac{\partial \pi(a|s)}{\partial \theta} Q(a,s) + \pi(a|s) \gamma \sum_{s'} p(s'|s,a) \dfrac{\partial V(s') }{\partial \theta} \right] \end{aligned} $$ When one now expands next period's value function $V(s')$ again one can eventually reach the final policy gradient: $$ \dfrac{\partial J}{\partial \theta} = \sum_s \rho(s) \sum_a \dfrac{\pi(a|s)}{\partial \theta} Q(s,a) $$ with $\rho$ being the stationary distribution. What I find particularly interesting is that there is no derivative of $R$ with respect to $\theta$ and also not of the probability distribution $p(s'|s,a)$ with respect to $\theta$. The derivation of the deterministic policy gradient theorem is different:

Deterministic Policy Gradient Theorem $$ \begin{aligned} \dfrac{\partial V}{\partial \theta} &= \dfrac{\partial}{\partial \theta} Q(\pi(s),s) \\ &= \dfrac{\partial}{\partial \theta} \left[ R(s, \pi(s)) + \gamma \sum_{s'} p(s'|a,s) V(s') \right] \\ &= \dfrac{R(s, a)}{\partial a}\dfrac{\pi(s)}{\partial \theta} + \dfrac{\partial}{\partial \theta} \left[\gamma \sum_{s'} p(s'|a,s) V(s') \right] \\ &= \dfrac{R(s, a)}{\partial a}\dfrac{\pi(s)}{\partial \theta} + \gamma \sum_{s'} \left[p(s'|\mu(s),s) \dfrac{V(s')}{\partial \theta} + \dfrac{\pi(s)}{\partial \theta} \dfrac{p(s'|s,a)}{\partial a} V(s') \right] \\ &= \dfrac{\pi(s)}{\partial \theta} \dfrac{\partial}{\partial a} \left[ R(s, a) + p(s'|s,a) V(s') \right] + \gamma p(s'|\pi(s),s) \dfrac{V(s')}{\partial \theta} \\ &= \dfrac{\pi(s)}{\partial \theta} \dfrac{\partial Q(s, a)}{\partial a} + \gamma p(s'|\pi(s),s) \dfrac{V(s')}{\partial \theta} \\ \end{aligned} $$ Again, one can obtain the finaly policy gradient by expanding next period's value function. The policy gradient is: $$ \dfrac{\partial J}{\partial \theta} = \sum_s \rho(s) \dfrac{\pi(s)}{\partial \theta} \dfrac{\partial Q(s,a))}{\partial a} $$ In contrast to the standard policy gradient, the equations contain derivatives of the reward function $R$ and the conditional probability $p(s'|s, a,)$ with respect to $a$.

Question

Why do the two theorems differ in their treatment of the derivatives of $R$ and the conditional probability? Does determinism in the policy function make such a difference for the derivatives?

I don't have time to leave a full answer for now but in short it is because the latter requires the use of a chain rule because the parameters we differentiate with respect to are parameters of $\pi$ but we have access only to $Q$ -- it is like wanting to do the following $\frac{d}{dx} f(g(x))$. I can try and leave a better answer if nobody else has when I get chance. — David, Aug 04 '20 at 18:17
Thank you @DavidIreland for your comment! I'd be very grateful to read your full answer! Can you elaborate a bit why the chain rule should not be applicable in the usual policy gradient theorem? Implicitly, $R(s,a)$ is also function of $\pi(a|s)$, although a stochastic function. — fabian, Aug 05 '20 at 05:53

David · Accepted Answer · 2020-08-06T10:01:15.673

6

In the policy gradient theorem, we don't need to write $r$ as a function of $a$ because the only time we explicitly 'see' $r$ is when we are taking the expectation with respect to the policy. For the first couple lines of the PG theorem we have \begin{align} \nabla v_\pi(s) &= \nabla \left[ \sum_a \pi(a|s) q_\pi (s,a) \right] \;, \\ &= \sum_a \left[ \nabla \pi(a|s) q_\pi(s,a) + \pi(a|s) \nabla\sum_{s',r} p(s',r|s,a)(r+ v_\pi(s')) \right] \; ; \end{align} you can see that we are taking expectation of $r$ with respect to the policy, so we don't need to write something like $r(s,\pi(a|s))$ (especially because this notation doesn't really make sense for a stochastic policy). This is why we don't need to take the derivative of $r$ with respect to the policy parameters. Now, the next line of the PG theorem is $$\nabla v_\pi(s) = \sum_a \left[ \nabla \pi(a|s) q_\pi(s,a) + \pi(a|s)\sum_{s'} p(s'|s,a) \nabla v_\pi(s') \right] \; ;$$ so now we have an equation similar to the bellman equation in terms of the $\nabla v_\pi(s)$'s, so we can unroll this repeatedly meaning we never have to take an explicit derivative of the value function.

For the deterministic gradient, this is a bit different. In general we have $$v_\pi(s) = \mathbb{E}_\pi[Q(s,a)] = \sum_a \pi(a|s) Q(s,a)\;,$$ so for a deterministic policy (denoted by $\pi(s)$ which represents the action taken in state $s$) this becomes $$v_\pi(s) = Q(s,\pi(s))$$ because the deterministic policy has 0 probability for all actions except one, where it has probability one.

Now, in the deterministic policy gradient theorem we can write $$\nabla v_\pi(s) = \nabla Q(s,\pi(s)) = \nabla \left(r(s, \pi(s)) + \sum_{s'} p(s'|s,a)v(s') \right)\;.$$

We have to write $r$ explicitly as a function of $s,a$ now because we are not taking an expectation with respect to the actions because we have a deterministic policy. Now, if you replace where I have written $\nabla$ with the notation you have used for the derivatives you will arrive at the same result and you'll see why you need to use the chain rule, which I believe you understand because your question was more why don't we use the chain rule for the normal policy gradient, which I have hopefully explained -- it is essentially because of how an expectation over the action space works with a deterministic policy vs. a stochastic policy.

Another way to think of this is as follows -- the term you're concerned with is obtained by expanding $\nabla q_\pi(s,a) = \nabla \sum_{s', r}p(s',r|s,a)(r(s,a) + v_\pi(s'))$. Because, by definition of the $Q$ function, we have conditioned on knowing $a,s$ then $a$ is completely independent of the policy in this scenario - we could even condition on an action that the policy would have 0 probability for - thus the derivative of $r$ with respect to the policy parameters is 0.

However, in the deterministic policy gradient we are taking $\nabla q_\pi(s, \pi(s)) = \nabla \left(r(s, \pi(s)) + \sum_{s'} p(s'|s,a) v_\pi(s')\right)$ -- here $r$ clearly depends on the policy parameters because the action taken was the deterministic action given by the policy in the state $s$, thus the derivative wrt the policy parameters is not necessarily 0!

edited Aug 06 '20 at 10:01

answered Aug 05 '20 at 10:23

David

4,591
1
6
25

Thank you very much David! Your answer is extremely helpful to me. I mark the answer as accepted. Nevertheless, I still do not fully comprehend your answer - most likely because I am lacking the knowledge about how conditional expectations interact with derivatives. I understand your argument as saying that one cannot compute $\partial E[y(x) | x] / \partial x$? If, by any chance, you know a good book for how conditional expectations affect derivatives, I'd be happy to read that! Anyway - thanks again for the detailed answer! – fabian Aug 05 '20 at 16:41
1

it is not so much that we can't compute that derivate, in fact in general this derivative is possible to calculate. It is more that in the stochastic policy gradient the policy affects the reward through the expectation, if this makes sense. However, in the deterministic policy case, you do technically still take the same expectation it is just because of how deterministic expectations work out that it directly effects the reward, which is what I tried to stress in my answer. – David Aug 05 '20 at 18:06
1

@fabian also a good way to think of it is that in the standard policy gradient $r$ is a function of $s,a$ (denote this by $r(s,a)$) but we condition on knowing $a$ so this is independent of the policy, thus it is constant wrt policy parameters. In the deterministic PG we still condition on taking an action $a$ but here the action is $a=\pi(s)$ which shows there is a clear dependence on the policy parameters. Please let me know if this helps clarify things and I can edit it into my answer, as I feel my original explanation may have been a bit convoluted. – David Aug 05 '20 at 18:13
Thanks for your comments! I benefited particularly from your first comment. In the second comment, we could think of the stochastic policy just as the deterministic policy with a little bit of noise (as the behavioral exploration policy in DDPG) so there is also a *clear dependence on the policy parameters* as you said. Overall, its still a bit difficult for me to understand the precise mathematical reason / proof for the differences in the two theorems but your answers and comments illuminated many aspects greatly for me; thank you again for that. – fabian Aug 06 '20 at 10:54
1

@fabian no, not really. the deterministic policy is a limiting case for a stochastic policy, as shown in the deterministic policy gradient paper. If you read my edit on the paper, then it should clarify things further. Because we have conditioned on some $a$ in the stochastic case, there is no dependency on the policy, because as I mentioned it is just an arbitrary action - when this is done in the deterministic case there is absolutely a dependency because we are conditioning _on the action the deterministic policy gives_. – David Aug 06 '20 at 13:21
1

Your last comment (and also the last two paragraphs of your answer) brought the point home. This really makes sense. Thank you very much indeed! – fabian Aug 06 '20 at 15:08
Just to summarize and check whether I understood it: In the stochastic case (and under conditional expectation in general) $E[r|s,a]=f(s,a)$ for some function $f$. Because $a$ is stochastic $a \sim \pi(s; \theta)$ and because we have already conditioned on $a$ we *cannot go back* and take the derivative w.r.t. $\theta$. In contrast for the deterministic case $E[r|s]=f(s) = f(\pi(s; \theta))$ we can take the derivative w.r.t $\theta$ as the entire chain is deterministic. Could I summarize your argument like this, @David? – fabian Aug 06 '20 at 16:04
Yes, exactly. Good job! – David Aug 06 '20 at 16:20
@David thank you for your answer, but could you elaborate on notation I am still unfamiliar with ? I can't find corresponding parts in the proof or justification for some of your statements. Eg.: (a) $(r + v_{\pi}(s'))$ akin more to the Bellman-equ. (2nd equ-line) never appears in the proof and I can't find it by basic rewriting of the equations (yet considering $\mathcal{P}_{ss'}^a$). (b) The deterministic optimal state value function $v_{\pi}(s)$ is the expected value (under policy $\pi$?) of all action value functions but just a sum of values? (c) Why condition on the action not state $s$? – gr4nt3d Aug 19 '23 at 17:23

Why do the standard and deterministic Policy Gradient Theorems differ in their treatment of the derivatives of $R$ and the conditional probability?

1 Answers1