Why do we limit the standard deviation in Actor architectures in Reinforcement Learning?

Question

I'm in the process of implementing Actor-Critic structures for Reinforcement Learning (RL) and I've noticed that it's a common practice to limit the standard deviation (std). I've seen this in different implementations, specifically in two types of Actor classes, which I'll refer to as ActorOne (from link) and ActorTwo (from link).

In both implementations, I've observed the restriction of the standard deviation. In ActorOne, the output of a Linear layer is passed through a tanh function to constrain it to the range [-1,1], which is then further transformed to map to a specific range (LOG_SIG_MIN and LOG_SIG_MAX) to get the final log_std. In ActorTwo, the output of a Linear layer is directly clamped to a specific range (LOG_SIG_MIN, LOG_SIG_MAX) to obtain the final log_std.

Here's a snippet from the ActorOne class:

log_std = linear_layer(h)
log_std = torch.tanh(log_std)
log_std = LOG_SIG_MIN + 0.5 * (LOG_SIG_MAX - LOG_SIG_MIN) * (log_std + 1)

And here's a snippet from the ActorTwo class:

log_std = linear_layer(h)
log_std = torch.clamp(log_std, LOG_SIG_MIN, LOG_SIG_MAX)

I have two main questions about this:

What is the purpose of limiting the standard deviation in these implementations? Could there be any practical implications or potential issues if the standard deviation is not limited?
What are the differences between these two methods? From what I understand, in the first approach, the gradient is always preserved, while in the second method, the data that gets processed by Clamp will have its gradient cut off. What are the strengths and weaknesses of each method, and how should I choose between them?

Any insights or recommendations on this topic would be greatly appreciated.

@LucaAnzalone The link contains a clear explanation with an illustrative example, which answered my first question. Thanks. — XiaoBanni, Jul 29 '23 at 04:24

score 1 · Answer 1 · answered Jul 28 '23 at 18:47

1

Bounding the log-std is a common practice to avoid/reduce instabilities during training and also avoid numerical issues and NaNs since the log-std is unbounded, being in $(-\infty,\infty)$, when taking the exponential of such quantity (to actually obtain the std) such value can easily blow up causing a NaN, which is propagated until the learning objective making also debug difficult. Therefore you want to do this in practice, but keep in mind that the interval you choose to limit the log-std also limits the range of the actions that the policy can sample. So don't be too restrictive.
I'm not sure but I think the solution with the tanh bounding may yields smoother gradients compared to the clamp, which should yield a zero-gradient for that layer if the value is actually clamped. Also, the resulting gradient of the first solution should have a term that accounts for the derivative of tanh, so I think the overall gradient to be different. Actually without experiments and a fair comparison is difficult to tell which solution is better (more stable, faster to learn, etc.)

answered Jul 28 '23 at 18:47

Luca Anzalone

2,120
2
13

btw, nobody is forcing us to use the exponential to get the std.. you can definitely use any strictly positive function, such as a $ELU+1$, and you can solve the risk of getting exponentially big gradients – Alberto Jul 28 '23 at 23:54
1

a personal note on the second one, IMHO the tanh might be better, as it has been shown that smooth landscape of the loss function helps the RL agent to learn better policy (to avoid stuff like cliffs in the loss), as in the case of the "what really matters in Deep RL" with the activation function of the FF layers (buu relu, upp tanh in that case) – Alberto Jul 28 '23 at 23:58
@AlbertoSinigaglia I understand you're suggesting using a positive function like `ELU+1` to achieve a positive **std**, rather than employing the `exp` of an unbounded **log_std**. Then we apply the aforementioned `tanh and mapping`, or `clamp` to restrict the output. Right? This approach seems viable, but it's not commonly used. Could you provide relevant open-source code for further study? – XiaoBanni Jul 29 '23 at 04:55
@AlbertoSinigaglia Sure, also a `softplus` can be used. The point is that all of these are unbounded. So, by clipping the log-std you're also implicitly defining a maximum value for the std regardless the function you use to exponentiate, which prevents to get infinite values. – Luca Anzalone Jul 29 '23 at 16:08
1

yes you are definitely right on that, I was just pointing out that there are ways to avoid the others (also valid) problems you brought up... about it being unbounded, nothing to say against it ahah – Alberto Jul 29 '23 at 18:10
@XiaoBanni they recommend it here also https://arxiv.org/pdf/2006.05990.pdf – Alberto Jul 31 '23 at 12:25

Why do we limit the standard deviation in Actor architectures in Reinforcement Learning?

1 Answers1