Why do we also need to normalize the action's values on continuous action spaces?

Question

I was reading here tips & tricks for training in DRL and I noticed the following:

always normalize your observation space when you can, i.e., when you know the boundaries

normalize your action space and make it symmetric when continuous (cf potential issue below) A good practice is to rescale your actions to lie in [-1, 1]. This does not limit you as you can easily rescale the action inside the environment

I am working on a discrete action space but it is quite difficult to normalize my states when I don't actually know the full range for each feature (only an estimation).

How does this affect training? And more specifically, why on continuous action spaces we need to normalize also the action's values?

score 3 · Accepted Answer · answered Jun 05 '20 at 16:53

Notably, these two tips/tricks are useful because we are assuming the context of deep reinforcement learning here, as you pointed out. In DRL, the RL algorithm is guided in some fashion by a deep neural network, and the reasons for normalizing stem from the gradient descent algorithm and the architecture of the network.

How does this affect training?

An observation from the observation space is often used as an input to a neural network in DRL algorithms, and normalizing the input to neural networks is beneficial for many reasons (e.g. increases convergence speed, aids computer precision, prevents divergence of parameters, allows for easier hyperparameter tuning, etc.). These are standard results in DL theory and practice, so I won't provide details here.

And more specifically, why on continuous action spaces we need to normalize also the action's values?

Most popular discrete action space DRL algorithms (e.g. DQN) have one output node for each possible action in the neural net. The value of the output node may be a q-value (value-based algorithm) or a probability of taking that action (policy-based algorithm).

In contrast, a continuous action space DRL algorithm simply cannot have an output node for each possible action, as the action space is continuous. The output is usually the actual action to be taken by the agent or some parameters that could be used to construct the action (e.g. PPO outputs a mean and standard deviation and then an action is sampled from the corresponding Gaussian distribution - this phenomenon is mentioned in your linked reference). Therefore, normalizing the action space of a DRL algorithm is analogous to normalizing the outputs of the corresponding neural network, which is known to increase training speed and prevent divergence. Again, a quick search will yield some good resources if you are interested in these results.

Why do we also need to normalize the action's values on continuous action spaces?

1 Answers1