1

If we want to learn a stochastic policy with the policy gradient method, we have to sample from the distribution to get an action.

Wouldn't this lead to the same issue that variational autoencoders face without the reparameterization trick, where the gradient cannot pass through sampling? Or is the reparameterization trick also used in the stochastic policy gradient method to address this issue?

nbro
  • 39,006
  • 12
  • 98
  • 176
Sam
  • 175
  • 5

1 Answers1

2

For Policy-Gradient methods which require a differentiable sample from the action distribution, such as Soft Actor-Critic, then you are correct that it suffers from the same problem of requiring the gradient to be able to pass through these samples.

The common way around this is to consider distributions that can be sampled from using the re-parameterisation trick. The classic example of this is with a Gaussian distribution: suppose our network outputs mean and standard deviation $\mu_\theta$ and $\sigma_{\theta}$, respectively ($\theta$ are the network parameters), then we can note that $X = \mu_\theta + \sigma_\theta \times \epsilon$, where $\epsilon$ follows a Unit-Normal distribution. So, to sample from $X$, all we need to do is sample from the unit normal distribution, which does not require us to backprop through our network parameters.

The same can be achieved for discrete actions using the Gumbel-Softmax trick. For information on this, I would recommend this blog-post (and the references within).

For algorithms derived from the REINFORCE algorithm, where the policy gradient is given by $G_t \nabla_\theta \log \pi_\theta(a|s)$, then you can see that this is equivalent to maximising the (log) likelihood, scaled by the returns. All we need to do here is obtain the parameters of the distribution of our policy (e.g. mean and std-dev of a Gaussian) by doing a forward pass of the network for the current state, and then evaluating the PDF/PMF at the action we selected -- this does not require the sampled action to be differentiable.

David
  • 4,591
  • 1
  • 6
  • 25
  • I think I am still very confused about something: let's say for a non-actor-critic policy gradient method, we already have a stochastic policy. I don't understand how is the policy calculated w.r.t. the outcome, since the outcome is the realization of one possibility of the action from the distribution. In other words, from a particular outcome, the outside observer wouldn't be able to know this is a stochastic policy. How to compete the gradient w.r.t. this outcome then? – Sam Apr 04 '23 at 16:28
  • if the answer is monte carlo, then even of the stochastic policy is a discrete categorical distribution, it doesn't matter anymore isn't it? – Sam Apr 04 '23 at 16:59
  • It's not totally clear to me what you mean. If not using a PG method that requires sampling from the policy, the alternatives are value learning (we don't explicitly look at the policy here, just learn a Q-function for, usually, the greedy policy), or alternatively a PG algorithm that originates from the REINFORCE update. Here, whilst we sample from the policy, we don't need to differentiate through the sampled action. We just maximise the likelihood (under the policy) of taking the action we saw, scaled by the returns. – David Apr 04 '23 at 21:25
  • 'Here, whilst we sample from the policy, we don't need to differentiate through the sampled action. We just maximise the likelihood (under the policy) of taking the action we saw, scaled by the returns.' but to update the model parameters which parameterize the policy, we still need gradient don't we? – Sam Apr 05 '23 at 09:10
  • Yes, but you don't need to differentiate through the sampled action. You just do a forward pass of your network to output the parameters of your distribution, and then evaluate the pdf/pmf at the action you chose and differentiate the output (remembering to also scale by the returns). – David Apr 05 '23 at 09:14