4

When trying to implement my own PPO (Proximal Policy Optimizer), I came across two different implementations :

Exploration with std

  1. Collect trajectories on $N$ timesteps, by using a policy-centered distribution with progressively trained std variable for exploration
  2. Train policy function on $K$ steps
  3. Train value function on $K$ steps

For example, the OpenAI's implementation.

Exploration with entropy

  1. Collect trajectories on $N$ timesteps, by using policy function directly
  2. Train policy and value function at the same time on $K$ steps, with a common loss for the two models, with additional entropy bonus for exploration purpose.

For example, the PPO algorithm as described in the official paper.

What are the pros/cons of these two algorithms?

Is this specific to PPO, or is this a classic question concerning policy gradients algorithms, in general?

nbro
  • 39,006
  • 12
  • 98
  • 176
Loheek
  • 266
  • 2
  • 6

1 Answers1

4

Both implementations may be closer than you think.

In short:

PPO has both parts: there is noisiness in draws during training (with learned standard deviation), helping to explore new promising actions/policies. And there is a term added to the loss function aiming to prevent a collapse of the noisiness, to help ensure exploration continues and we don't get stuck at a bad (local) equilibrium.

In fact, for continuous action, the term for entropy in the loss function you describe in Ex. 2, can make sense only when the actions are stochastic, i.e. when there action choice has some standard deviation the way you describe in Ex. 1.

More detail:

On one hand, PPO (at least for continuous action), trains a central/deterministic value (say the mean policy or close to mean) targeting a most profitable action path. On the other hand, along with it, a standard deviation making the actions a random draw with noise around the deterministic value. This is the part you describe in Example 1. The noise helps explore new paths and to update the policy according to the rewards on these sampled paths. Entropy itself is a measure of the noisiness of the draws, an thus also an indirect indicator for the trained standard deviation value(s) of the policy.

Now, entropy tends to decay as training progresses, that is, the random draws become progressively less random. This can be good for reward maximization - really the best draws are taken for reward maximization - but it is bad for further improvements of policy: improvement may halt or slow down as exploration of new action paths fades.

This is where entropy encouragement comes in. PPO foresees the inclusion of entropy in the loss function: we reduce the loss by x * entropy, with x the entropy coefficient (e.g. 0.01), incentivizing the learning network to increase the standard deviations (or, to not let them drop too much). This part is what you describe in Example 2.

Further notes:

  • During exploitation, we'd typically turn off the noise (implicitly assuming action std = 0) and pick deterministic actions: in normal cases this increases the payoffs; we're choosing our best action estimate, rather than at a random value around it.

  • People are not always precise when referring to the model's entropy vs. the entropy coefficient added to the loss function.

  • Other RL algorithms with continuous action tend to use noisy drwas with standard deviations/entropy too.

FlorianH
  • 141
  • 4