5

In the Berkeley RL class CS294-112 Fa18 9/5/18, they mention the following gradient would be 0 if the policy is deterministic.

θJ(θ)=Eτπθ(τ)[(Tt=1θlogπθ(atst))(Tt=1r(st,at))]

Why is that?

nbro
  • 39,006
  • 12
  • 98
  • 176
jonperl
  • 153
  • 7

2 Answers2

13

Here is the gradient that they are discussing in the video:

θJ(θ)1NNi=1(Tt=1θlogπθ(ai,t|si,t))(Tt=1r(si,t,ai,t))

In this equation, $\pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t})$ denotes the probability of our policy $\pi_{\theta}$ selecting the actions $\mathbf{a}_{i, t}$ that it actually ended up selecting in practice, given the states $\mathbf{s}_{i, t}$ that it encountered during the episode that we're looking at.

In the case of a deterministic policy $\pi_{\theta}$, we know for sure that the probability of it selecting the actions that it did select must be $1$ (and the probability of it selecting any other actions would be $0$, but such a term does not show up in the equation). So, we have $\pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t}) = 1$ for every instance of that term in the above equation. Because $\log 1 = 0$, this leads to:

θJ(θ)1NNi=1(Tt=1θlogπθ(ai,t|si,t))(Tt=1r(si,t,ai,t))=1NNi=1(Tt=1θlog1)(Tt=1r(si,t,ai,t))=1NNi=1(Tt=1θ0)(Tt=1r(si,t,ai,t))=1NNi=10(Tt=1r(si,t,ai,t))=0

(i.e. you end up with a sum of terms that are all multiplied by $0$).

nbro
  • 39,006
  • 12
  • 98
  • 176
Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
  • Just thinking this through, for the case when the policy is stochastic. If you take a step in the direction of $\nabla_{\theta} \log \pi_{\theta}$, this will increase the probability the policy takes an action -- since the log from (0 -> 1] is decreasingly negative. Since the gradient is multiplied by the reward, by increasing the probability of an action you increase the expected value of that state-action reward (when the reward is positive, decrease the expected value if it is negative). Is this correct? – jonperl Sep 07 '18 at 01:41
  • Ok, yes that is basically what he says at https://youtu.be/XGmd3wcyDg8?t=1295. Thank you again so much -- I couldn't get past that part of the video until I understood the formula. – jonperl Sep 07 '18 at 02:00
  • 1
    @jonperl Yes that's correct. – Dennis Soemers Sep 07 '18 at 08:20
3

Well, I'd rather comment, but I don't have yet this privilege, so here are some comments.

First, having a deterministic policy inside the log would do create trivial terms.

Secondly, for me, in Policy Gradient methods, it's a non sense to have a deterministic policy during the optimization, because you want to explore the space of weights. In my experience, you only set the policy to deterministic (in PG method) when you're done with the optimization, and you want to test your network.

16Aghnar
  • 591
  • 2
  • 10