3

I'm working on a Reinforcement Learning task where I use reward shaping as proposed in the paper Policy invariance under reward transformations: Theory and application to reward shaping (1999) by Andrew Y. Ng, Daishi Harada and Stuart Russell.

In short, my reward function has this form:

$$R(s, s') = \gamma P(s') - P(s)$$

where $P$ is a potential function. When $s = s'$, then $R(s, s) = (\gamma - 1)P(s)$, which is non-positive, since $0 < \gamma <= 1$.

But considering $P(s)$ relatively high (let's say $P(s) = 1000$), $R(s, s)$ become too high as well (e.g. with $\gamma=0.99$, $R(s,s)=-10$), and if for many steps the agent stays in the same state, then the cumulative reward becomes more and more negative, which might affect the learning process.

In practice, I solved the problem by just removing the factor $P(s)$ when $s = s'$. But I have some doubts about the theoretical correctness of this "implementation trick".

Another idea could be to scale appropriately $\gamma$ in order to give a reasonable reward. Indeed, with $\gamma=1.0$, there is no problem, and, with gamma very near to $1.0$, the negative reward is tolerable. Personally, I don't like it because it means that $\gamma$ is somehow dependent on the reward.

What do you think?

nbro
  • 39,006
  • 12
  • 98
  • 176

2 Answers2

3

I don't think the situation you're sketching should be a problem at all.

If $P(s)$ is high (e.g. $P(s) = 1000$), this means (according to your shaping / "heuristic") that it's valuable to be in the state $s$, that you expect to be able to get high future returns from that state.

If you then continuously take actions that keep you in the same state $s$, it is essentially correct to punish these actions (with negative rewards); you were expecting to get high future returns starting from that state, but you just remain stuck in that state. This means that the actions you're taking are not providing the rewards you were expecting, so it should be "punished".

Of course, the exception to the above paragraph is the case where the transition from $s$ to $s$ (staying in the same state) does result in a positive "true reward" (actual/real reward, not shaping-reward). In that case, the true reward will offset the reduction in potential and result in a neutral or positive combined reward if it is sufficiently large.

As for the $\gamma$ parameter, a $\gamma < 1.0$ (such as $\gamma = 0.99$) intuitively means that you prioritize short-term rewards over long-term rewards (if the magnitudes of the undiscounted rewards are similar). This implies that what you wrote in the question is precisely what should happen; if you prioritize short-term rewards over long-term rewards, it's bad to "waste time" by staying in the same state, so actions that don't move you anywhere should be disincentivized through negative rewards. If $\gamma = 1.0$, you have no preference for short-term rewards over long-term rewards, so you'll be fine with "wasting time" by staying in the same state, and therefore no longer have to punish such actions.

nbro
  • 39,006
  • 12
  • 98
  • 176
Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
  • 1
    Thank you! So do you think this "numerical issue" is not an issue at all, from a theoretical point of view? I.e. convergence is still guaranteed, provided all the assumptions and premises? – Marco Favorito May 10 '18 at 10:20
  • 2
    @MarcoFavorito Correct. It seems to me like it's doing exactly what it should be doing. `gamma < 1.0` implies you want to punish "wasting time", and that's exactly what it's doing. – Dennis Soemers May 10 '18 at 10:37
  • 1
    @DennisSoemers, `gamma < 1.0` does not necessarily result in the potential-based term penalizing "wasting time." If the potential function is negative for all states, then "wasting time" will actually lead to a *positive* potential-based term. See my answer below for some intricacies/practical issues of potential-based reward shaping. – Brenden Petersen Jun 13 '18 at 22:25
  • 1
    @BrendenPetersen Sure, I was specifically considering the case of the question, where potentials were said to be high positive values. In the case you propose, with negative potentials, indeed you'd get positive terms for staying in the same state. That's again not a problem by definition though. `R(s, s)` might be positive for a negative potential `P(s)`, but `R(s, s')` for a state `s'` with a greater potential `P(s') > P(s)` will be **even more positive**, so that behaviour will still get reinforced more relatively. – Dennis Soemers Jun 14 '18 at 08:24
3

Dennis Soemers provides an important point that from a theoretical standpoint, this can be seen as a non-issue. However, what you bring up is an important practical issue of potential-based reward shaping (PBRS).

The issue is actually worse than you describe---it's more general than $s = s'$. In particular, the issue presents itself differently based on the sign of your potential function. For example, in your case it looks like the potential function is positive: $P(s) > 0$ for all $s$. The issue (as you have found) is that an increase in potential (regardless of whether $s = s'$) might not be enough to overcome the multiplication by $\gamma$, and thus the PBRS term may be negative. In particular, only when the fold-change in $P$ is large enough ($\frac{P(s')}{P(s)} > \frac{1}{\gamma}$) will the PBRS term actually be positive.

The situation changes when the potential function is negative, i.e. if $P(s) < 0$ for all $s$. In this case, you can actually get a positive PBRS signal even when there is a decrease in potential! In particular, only when the fold-change in $P$ is large enough (same inequality as before) will the PBRS term actually be negative.

To summarize, when $P > 0$, a decrease in potential will always lead to a negative PBRS term, but an increase must overcome a barrier due to $\gamma$ for the term to be positive. When $P < 0$, an increase in potential will always lead to a positive PBRS term, but a decrease must overcome a barrier due to $\gamma$ for the term to be negative.

The intuition behind PBRS is that improving the potential function should be rewarded, and decreasing it should be penalized. However, it turns out that whether or not this holds true depends on things like 1) the sign of the potential function, 2) the fold-change in potential, or 3) the resolution of your environment. For #3, if the temporal resolution of your environment can be altered such that an action brings you "partway" from $s$ to $s'$, then at some environment resolution you will run into one of the two problematic circumstances above. Another issue is that PBRS is highly sensitive to, for example, adding a constant to the potential function.

Another related issue is that whether or not some constant improvement to the potential function leads to a positive/negative reward depends on how far you are from the "goal" state. Often potential functions are chosen such that they estimate how good a state is (after all, the best option for a potential function is the optimal value function). Say we choose $\gamma=0.99$ and that $P(s_{goal}) = 1000$ represents a goal state. Then increasing potential by one from $P(s) = 900$ to $P(s') = 901$ will have a negative reward of $-8.01$. In contrast, increasing potential by one from $P(s) = 90$ to $P(s') = 91$ will have a small positive reward of $+0.09$. This is another issue: the sign of the PBRS term depends on distance from the goal.

This paper has some interesting examples and outlines many of the issues above.

From my own experience, this is a large practical issue. The LunarLanderContinuous-v2 environment from OpenAI Gym includes a PBRS term, but they exclude the multiplication by $\gamma$ (i.e., $\gamma = 1$), presumably because the environment benchmark doesn't know the true discounting the RL user chooses. This environment can be solved using DDPG, for example, without significant hyperparameter tuning. However, if you use $\gamma = 0.99$ for your RL formulation, and edit the LunarLander code such that the PBRS term includes $\gamma = 0.99$, then DDPG fails to solve the environment. So, this is not a small computational issue---it has dramatic effects on training.

My solution has been to simply set $\gamma = 1$ in the PBRS term, even when using, say, $\gamma = 0.99$ in the RL formulation. This solves (or rather, circumnavigates) every issue above. While this loses out on the theoretical guarantee that adding the PBRS term does not affect the optimal policy, it can severely help training. (And there are no optimality guarantees using neural networks as function approximators anyway.)

This solution also seems to be what most benchmark environments have adopted. For example, most MuJoCo environments use PBRS terms with no $\gamma$ (equivalent to $\gamma = 1$). Alternatively, the omission of $\gamma$ could be attributed to the fact that including it would require the environment to know a priori what value of $\gamma$ the RL user chose. While feeding this into an OpenAI gym environment is easy to do, it's not typically done.

Keep in mind that while the theory guarantees that the optimal policy won't change by adding the PBRS term, adding the term doesn't necessarily help you approach the optimal policy. Yet, the whole point of using PBRS at all is to help you approach a good policy. So, it's a bit of a paradox, and I was comfortable with sacrificing the theoretical guarantee of policy invariance if it meant I could actually get to a good policy in the first place.

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
  • 1
    thank you! You confirmed my intuitions about this topic. As a reference, besides the cited paper, one might be interested in the one of the author's PhD thesis (Chapter 5): http://etheses.whiterose.ac.uk/936/ – Marco Favorito Jun 14 '18 at 10:53