In the Berkeley RL class CS294-112 Fa18 9/5/18, they mention the following gradient would be 0 if the policy is deterministic.
∇θJ(θ)=Eτ∼πθ(τ)[(T∑t=1∇θlogπθ(at∣st))(T∑t=1r(st,at))]
Why is that?
In the Berkeley RL class CS294-112 Fa18 9/5/18, they mention the following gradient would be 0 if the policy is deterministic.
∇θJ(θ)=Eτ∼πθ(τ)[(T∑t=1∇θlogπθ(at∣st))(T∑t=1r(st,at))]
Why is that?
Here is the gradient that they are discussing in the video:
∇θJ(θ)≈1NN∑i=1(T∑t=1∇θlogπθ(ai,t|si,t))(T∑t=1r(si,t,ai,t))
In this equation, $\pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t})$ denotes the probability of our policy $\pi_{\theta}$ selecting the actions $\mathbf{a}_{i, t}$ that it actually ended up selecting in practice, given the states $\mathbf{s}_{i, t}$ that it encountered during the episode that we're looking at.
In the case of a deterministic policy $\pi_{\theta}$, we know for sure that the probability of it selecting the actions that it did select must be $1$ (and the probability of it selecting any other actions would be $0$, but such a term does not show up in the equation). So, we have $\pi_{\theta} (\mathbf{a}_{i, t} \vert \mathbf{s}_{i, t}) = 1$ for every instance of that term in the above equation. Because $\log 1 = 0$, this leads to:
∇θJ(θ)≈1NN∑i=1(T∑t=1∇θlogπθ(ai,t|si,t))(T∑t=1r(si,t,ai,t))=1NN∑i=1(T∑t=1∇θlog1)(T∑t=1r(si,t,ai,t))=1NN∑i=1(T∑t=1∇θ0)(T∑t=1r(si,t,ai,t))=1NN∑i=10(T∑t=1r(si,t,ai,t))=0
(i.e. you end up with a sum of terms that are all multiplied by $0$).
Well, I'd rather comment, but I don't have yet this privilege, so here are some comments.
First, having a deterministic policy inside the log would do create trivial terms.
Secondly, for me, in Policy Gradient methods, it's a non sense to have a deterministic policy during the optimization, because you want to explore the space of weights. In my experience, you only set the policy to deterministic (in PG method) when you're done with the optimization, and you want to test your network.