How are continuous actions sampled (or generated) from the policy network in PPO?

Question

I am trying to understand and reproduce the Proximal Policy Optimization (PPO) algorithm in detail. One thing that I find missing in the paper introducing the algorithm is how exactly actions $a_t$ are generated given the policy network $\pi_\theta(a_t|s_t)$.

From the source code, I saw that discrete actions get sampled from some probability distribution (which I assume to be discrete in this case) parameterized by the output probabilities generated by $\pi_\theta$ given the state $s_t$.

However, what I don't understand is how continuous actions are sampled/generated from the policy network. Are they also sampled from a (probably continuous) distribution? In that case, which type of distribution is used and which parameters are predicted by the policy network to parameterize said distribution?

Also, is there any official literature that I could cite which introduces the method by which PPO generates its action outputs?

If you have a new question, you should ask it in a different post, even though it's related to the current question (I actually don't know), because that may invalidate the existing answers. — nbro, Dec 16 '20 at 19:35
I think these answers I am searching for belong fundamentally together since only knowing what to predict does not make sense in absence of the knowledge about how to get to that prediction eventually. And for getting there, the surrogate loss needs to be considered as well since otherwise you don't have any way to properly train the model (in spite of knowing what it shall predict). And just for the context: $r_t(\theta)$ is an important part of the surrogate loss. But anyway. I sort of see your point, so let's make it a separate question then. — Daniel B., Dec 16 '20 at 20:44
Never mind. I think this [question](https://ai.stackexchange.com/q/23276/37982) contains actually the answer to my edited question. I only hope that it is correct because this suspected answer is phrased as part of a question. But it sounds like a reasonable approach. So sorry for the confusion & I will 'revert' the edit. — Daniel B., Dec 16 '20 at 20:54
If you're not satisfied with that answer, eventually, you could ask a similar question but make sure to provide the context and say why you're not satisfied with that answer. — nbro, Dec 16 '20 at 21:16

score 3 · Accepted Answer · edited Dec 12 '20 at 12:05

3

As long as your policy (propensity) is differentiable, everything's is good. Discrete, continuous, other, doesn't matter! :)

A common example for continuous spaces is the reparameterization trick, where your policy outputs $\mu, \sigma = \pi(s)$ and the action is $a \sim \mathcal{N}(\mu, \sigma)$.

edited Dec 12 '20 at 12:05

nbro

39,006
12
98
176

answered Dec 12 '20 at 06:22

kaiwenw

151
3

Thanks for your answer! So, as I see it, action values are indeed sampled from some distribution in both the continuous and discrete case. In the continuous case, always(?) a Gaussian distribution is used of which the parameters are predicted. But how does the distribution look like for the discrete case? Is it indeed that in the discrete case, probabilities for sampling actions are only given by (possibly non-Gaussian-like distributed) probability vectors? And is there any nice paper that officially introduced these sampling schemes? – Daniel B. Dec 12 '20 at 14:03
In the discrete case, you can do epsilon greedy, softmax, or anything really. As long as the density of your policy is differentiable, you can run PPO – kaiwenw Dec 12 '20 at 21:19
So, after all there's no such thing like a fixed output layer architecture that a deep RL model would have to use in order to qualify as a PPO variant? It's really up to the researcher to decide how he/she want's a PPO variant's policy network's output to look like? Just double-checking one more time. :) – Daniel B. Dec 13 '20 at 02:04
1

@DanielB. exactly! :) the essence of REINFORCE, PPO, TRPO, Q-learning are the *way* the actors are updated, rather than a specific deep network architecture. For example, PPO/TRPO tries to stay in a "Trust Region", regardless of what policy architecture you choose. – kaiwenw Dec 13 '20 at 05:59
Ok, thank you very much! I think I am just subconsciously a bit biased towards believing that one always has to have a certain kind of output layer architecture, since I previously worked with Q-learning where it is of course pretty much dictated that, however the output layer looks like, it must be able to predict arbitrary Q-values (which of course makes the use certain activation functions etc impractical) and where the meaning of the output is clearly predetermined (to be Q-values), whereas the interpretation of outputs in PPO, TRPO... seems to be much more flexible. – Daniel B. Dec 13 '20 at 10:56
1

In the continuous case, how would the probability $\pi_\theta(a_t|s_t)$ be computed since we don't predict probability vectors any longer, but unconstrained real numbers instead? Just asking because we still need this to be able to compute the probability ratio $r(\theta)$. – Daniel B. Dec 14 '20 at 12:06
@DanielB. It seems that in the continuous case one has to use the re-parameterization trick or other methods to deal with probability distributions. I'm not sure if the re-parameterization trick allows multi-modal probability distributions to be represented. This seems to be a complicated topic. – Yan King Yin May 04 '23 at 11:11

How are continuous actions sampled (or generated) from the policy network in PPO?

1 Answers1