Why is the derivative of this objective function 0 if the policy is deterministic?

Question

In the Berkeley RL class CS294-112 Fa18 9/5/18, they mention the following gradient would be 0 if the policy is deterministic.

$\nabla_{\theta} J(\theta)=E_{\tau \sim \pi_{\theta}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}\left(\mathbf{a}_{t} \mid \mathbf{s}_{t}\right)\right)\left(\sum_{t=1}^{T} r\left(\mathbf{s}_{t}, \mathbf{a}_{t}\right)\right)\right]$

Why is that?

score 13 · Accepted Answer · edited Jan 01 '22 at 13:00

Here is the gradient that they are discussing in the video:

$\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t}) \right) \left( \sum_{t = 1}^T r(\mathbf{s}_{i,t}, \mathbf{a}_{i, t}) \right)$

In this equation, $\pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t})$ denotes the probability of our policy $\pi_{\theta}$ selecting the actions $\mathbf{a}_{i, t}$ that it actually ended up selecting in practice, given the states $\mathbf{s}_{i, t}$ that it encountered during the episode that we're looking at.

In the case of a deterministic policy $\pi_{\theta}$, we know for sure that the probability of it selecting the actions that it did select must be $1$ (and the probability of it selecting any other actions would be $0$, but such a term does not show up in the equation). So, we have $\pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t}) = 1$ for every instance of that term in the above equation. Because $\log 1 = 0$, this leads to:

$\begin{aligned} \nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t}) \right) \left( \sum_{t = 1}^T r(\mathbf{s}_{i,t}, \mathbf{a}_{i, t}) \right) \\ % &= \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=1}^T \nabla_{\theta} \log 1 \right) \left( \sum_{t = 1}^T r(\mathbf{s}_{i,t}, \mathbf{a}_{i, t}) \right) \\ % &= \frac{1}{N} \sum_{i=1}^N \left( \sum_{t=1}^T \nabla_{\theta} 0 \right) \left( \sum_{t = 1}^T r(\mathbf{s}_{i,t}, \mathbf{a}_{i, t}) \right) \\ % &= \frac{1}{N} \sum_{i=1}^N 0 \left( \sum_{t = 1}^T r(\mathbf{s}_{i,t}, \mathbf{a}_{i, t}) \right) \\ % &= 0 \\ \end{aligned}$

(i.e. you end up with a sum of terms that are all multiplied by $0$).

Just thinking this through, for the case when the policy is stochastic. If you take a step in the direction of $\nabla_{\theta} \log \pi_{\theta}$, this will increase the probability the policy takes an action -- since the log from (0 -> 1] is decreasingly negative. Since the gradient is multiplied by the reward, by increasing the probability of an action you increase the expected value of that state-action reward (when the reward is positive, decrease the expected value if it is negative). Is this correct? — jonperl, Sep 07 '18 at 01:41
Ok, yes that is basically what he says at https://youtu.be/XGmd3wcyDg8?t=1295. Thank you again so much -- I couldn't get past that part of the video until I understood the formula. — jonperl, Sep 07 '18 at 02:00

16Aghnar · Answer 2 · 2018-09-06T14:55:21.640

Well, I'd rather comment, but I don't have yet this privilege, so here are some comments.

First, having a deterministic policy inside the log would do create trivial terms.

Secondly, for me, in Policy Gradient methods, it's a non sense to have a deterministic policy during the optimization, because you want to explore the space of weights. In my experience, you only set the policy to deterministic (in PG method) when you're done with the optimization, and you want to test your network.

Why is the derivative of this objective function 0 if the policy is deterministic?

2 Answers2