I'm in the process of implementing Actor-Critic structures for Reinforcement Learning (RL) and I've noticed that it's a common practice to limit the standard deviation (std). I've seen this in different implementations, specifically in two types of Actor classes, which I'll refer to as ActorOne (from link) and ActorTwo (from link).
In both implementations, I've observed the restriction of the standard deviation. In ActorOne, the output of a Linear layer is passed through a tanh function to constrain it to the range [-1,1], which is then further transformed to map to a specific range (LOG_SIG_MIN and LOG_SIG_MAX) to get the final log_std. In ActorTwo, the output of a Linear layer is directly clamped to a specific range (LOG_SIG_MIN, LOG_SIG_MAX) to obtain the final log_std.
Here's a snippet from the ActorOne class:
log_std = linear_layer(h)
log_std = torch.tanh(log_std)
log_std = LOG_SIG_MIN + 0.5 * (LOG_SIG_MAX - LOG_SIG_MIN) * (log_std + 1)
And here's a snippet from the ActorTwo class:
log_std = linear_layer(h)
log_std = torch.clamp(log_std, LOG_SIG_MIN, LOG_SIG_MAX)
I have two main questions about this:
What is the purpose of limiting the standard deviation in these implementations? Could there be any practical implications or potential issues if the standard deviation is not limited?
What are the differences between these two methods? From what I understand, in the first approach, the gradient is always preserved, while in the second method, the data that gets processed by Clamp will have its gradient cut off. What are the strengths and weaknesses of each method, and how should I choose between them?
Any insights or recommendations on this topic would be greatly appreciated.