14

In Open AI's actor-critic and in Open AI's REINFORCE, the rewards are being normalized like so

rewards = (rewards - rewards.mean()) / (rewards.std() + eps)

on every episode individually.

This is probably the baseline reduction, but I'm not entirely sure why they divide by the standard deviation of the rewards?

Assuming this is the baseline reduction, why is this done per episode?

What if one episode yields rewards in the (absolute, not normalized) range of $[0, 1]$, and the next episode yields rewards in the range of $[100, 200]$?

This method seems to ignore the absolute difference between the episodes' rewards.

nbro
  • 39,006
  • 12
  • 98
  • 176
Gulzar
  • 729
  • 1
  • 8
  • 23

3 Answers3

8

The "trick" of subtracting a (state-dependent) baseline from the $Q(s, a)$ term in policy gradients to reduce variants (which is what is described in your "baseline reduction" link) is a different trick from the modifications to the rewards that you are asking about. The baseline subtraction trick for variance reduction does not appear to be present in the code you linked to.

The thing that your question is about appears to be standardization of rewards, as described in Brale_'s answer, to put all the observed rewards in a similar range of values. Such a standardization procedure inherently requires division by the standard deviation, so... that answers that part of your question.

As for why they are doing this on a per-episode-basis... I think you're right, in the general case this seems like a bad idea. If there are rare events with extremely high rewards that only occur in some episodes, and the majority of episodes only experience common events with lower-scale rewards... yes, this trick will likely mess up training.

In the specific case of the CartPole environment (which is the only environment used in these two examples), this is not a concern. In this implementation of the CartPole environment, the agent simply receives a reward with a value of exactly $1$ for every single time step in which it manages to "survive". The rewards list in the example code is in my opinion poorly-named, because it actually contains discounted returns for the different time-steps, which look like: $G_T = \sum_{t=0}^{T} \gamma^t R_t$, where all the individual $R_t$ values are equal to $1$ in this particular environment.

These kinds of values tend to be in a fairly consistent range (especially if the policy used to generate them also only moves slowly), so the standardization that they do may be relatively safe, and may improve learning stability and/or speed (by making sure there are always roughly as many actions for which the probability gets increased as there are actions for which the probability gets decreased, and possibly by making hyperparameters easier to tune).

It does not seem to me like this trick would generalize well to many other environments, and personally I think it shouldn't be included in such a tutorial / example.


Note: I'm quite sure that a per-episode subtraction of the mean returns would be a valid, albeit possibly unusual, baseline for variance reduction. It's the subsequent division by standard deviation that seems particularly problematic to me in terms of generalizing to many different environments.

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
  • What about normalizing each reward w.r.t ALL previous rewards, rather than just the last episode's? Would this make more sense, or still cause the same kind of problems? Thanks! – Gulzar Jan 26 '19 at 15:41
  • 1
    @Gulzar Intuitively I think that'd be fine. Eventually (likely quite soon) it'd "settle down" and you'd consistently be applying the same subtraction + division everywhere. At that point, I'm quite sure you could view the division as simply being an adaptation of your learning rate, which is clearly a hyperparameter that can be adjusted somewhat freely. I don't personally have experience doing this though, and didn't work out the math... which is what you'd want to do if you want to be 100% sure. – Dennis Soemers Jan 26 '19 at 15:53
  • @DennisSoemers Can we say that per-episode return is a baseline? Does it really depend on "state" only? If not, it is not a valid baseline, and hence biased. Consider the first state, s0, it has a return g0. A valid baseline must only depend on s0 and not a0. Consider, a mean return, (g0 + g1 + ... + gT) / T. It clearly has components of future rewards which clearly depend on a0. Hence, I don't think it is a valid baseline. – Phizaz Oct 23 '19 at 14:30
  • @Phizaz The "states" that the baseline is allowed to depend on, and the "actions" that the baseline should not depend on, are the states and actions "inside" the Expectation operator that we have in the expression for the gradient of the objective (see the openai link at the end of my answer). Technically such an empirical mean return from an episode doesn't depend on any of those states or actions inside the expectation, it's more like a constant number (not even a function of $s_t$). – Dennis Soemers Oct 24 '19 at 15:02
  • A small part of the math in the grad of the objective is something (simplified for brevity) that looks like $\mathbb{E}_{a_t \sim \pi} \left[ \nabla \log \pi(a_t \mid s_t) G_t \right]$. You could rewrite that expectation as summing over all actions, and multiplying the stuff inside the expectation by the probability $\pi(a_t \mid s_t)$. You can choose to subtract another quantity $b$ from $G_t$, but what's important is that in every possible "case" for $a_t$, you subtract the same quantity. – Dennis Soemers Oct 24 '19 at 15:07
  • @DennisSoemers I'm having a hard time chewing your argument that "empirical mean return" is a constant (not a function of s). It is clearly a random variable that depends on "action" and future actions. How could it be a constant? – Phizaz Oct 25 '19 at 00:27
  • @Phizaz See the derivation at http://tiny.cc/69b5ez under the "Understanding the Baseline" header, which shows why a baseline $b(s_t)$ does not introduce bias. At the right-hand side of the first line, we have an expectation $\mathbb{E}_{s_{t+1:T},a_{t:T-1}}$, i.e. an expectation under the assumption that everything AFTER reaching a state $s_t$ is sampled from the behaviour $\pi$. The baseline $b(s_t)$ is inside this expectation, and the proof works only if we're allowed to move it out of the expectation. – Dennis Soemers Oct 25 '19 at 08:28
  • You can think of that expectation as integrating/summing over all the possible "what-if we selected this $a_t$" scenarios, weighted by how likely $\pi$ is to select any particular $a_t$. Our mean empirical results were indeed determined by what happened in future timesteps in one concrete episode, but once we've arrived at the stage of evaluating (or approximating) this expectation, we do not pick different baseline values for all the different what-if cases being considered in the expectation. We use the same baseline value $b$ for every case $a_t$ being considered by the expectation/integral – Dennis Soemers Oct 25 '19 at 08:31
  • Coming back here after a while... How would you deal with multi scale rewards [which have opposite scale occurrence rate]? – Gulzar Feb 27 '21 at 12:46
6

This question is discussed in detail, in the following NeurIPS 2016 paper by David Silver: Learning values across many orders of magnitude. They also give experimental results over the Atari domain.

user26209
  • 61
  • 1
  • 1
4

We subtract mean from values and divide it with standard deviation to get data with mean of zero and variance of one. The range of values per episode does not matter, it will always make it to have zero mean and variance of one in all cases. If the range is bigger ([100, 200]) then deviation will be bigger as well than for smaller range ([0, 1]) so we will end up dividing by bigger number.

Brale
  • 2,306
  • 1
  • 5
  • 14
  • Thanks for the reply! I still don't understand though - in an episode with objectively large rewards, My assumption is we want to keep all the actions, rather than an episode with small rewards. I get that normalization will make them all in about the same range, but I don't understand why this is a good thing for very different episodes – Gulzar Jan 25 '19 at 18:46
  • 1
    I agree that it's debatable whether it's useful to apply such scaling to rewards in reinforcement learning. It makes intuitive sense to apply bigger steps in the direction of the gradient when the rewards are bigger rather then smaller, with scaling we potentionally lose such information. Contrary, such scaling can help stability of learning process especially when dealing with function approximators such as neural networks. Policy gradient methods are known to change its performance drastically (negatively) when applying to big of a gradient step, and scaling eliminates such possibility. – Brale Jan 25 '19 at 19:34
  • 1
    (continuing comment.) With scaling you could possibly not achive the most optimal performance but convergence of learning would be almost guaranteed. Also in practice it is probably not common that you have such big changes in reward ranges from episode to episode. One possibility is also, if you know entire reward range for the problem in advance you could a priori make scaling function that scales it to certain smaller range like [-1, 1]. – Brale Jan 25 '19 at 19:34
  • The network (on my own environment) actually does not converge (and does converge on CartPole), and I thought it may have something to do with that scaling factor. I actually have no idea how to debug in case the network doesn't converge, or what to do aside from increasing its size. Any thoughts? Thanks for your time! – Gulzar Jan 25 '19 at 19:45
  • like Dennis said in his answer, its possible that some good rare reward gets blended in into the averageness of other rewards, so you can try removing the scaling part. Also try training for longer time maybe ? If none of it works I'm sure you can find some other working implementation on github and try with that. – Brale Jan 25 '19 at 20:48