6

In cases where the reward is delayed, this can negatively impact a models ability to do proper credit assignment. In the case of a sparse reward, are there ways in which this can be negated?

In a chess example, there are certain moves that you can take that correlate strongly with winning the game (taking the opponent's queen) but typically agents only receive a reward at the end of the game, so as to not introduce bias. The downside is that training in this sparse reward environment requires lots of data and training episodes to converge to something good.

Are there existing ways to improve the agent's performance without introducing too much bias to the policy?

nbro
  • 39,006
  • 12
  • 98
  • 176
tryingtolearn
  • 385
  • 1
  • 2
  • 10

1 Answers1

5

Andrew Y. Ng (yes, that famous guy!) et al. proved, in the seminal paper Policy invariance under reward transformations: Theory and application to reward shaping (ICML, 1999), which was then part of his PhD thesis, that potential-based reward shaping (PBRS) is the way to shape the natural/correct sparse reward function (RF) without changing the optimal policy, i.e. if you apply PBRS to your sparse RF, the optimal policy associated with the shaped, denser RF is equal to the optimal policy associated with the original unshaped and sparse RF. This means that PBRS creates an equivalence class of RFs associated with the same optimal policy, i.e. there are multiple RFs associated with the same optimal policy. So, PBRS is the first technique that you can use to deal with sparse reward functions.

To give you more details, let $R\left(s, a, s^{\prime}\right)$ be your original (possibly sparse) RF for your MDP $M$, then

$$R\left(s, a, s^{\prime}\right)+F\left(s, a, s^{\prime}\right)\tag{1}\label{1}$$

is the shaped, denser RF for a new MDP $M'$.

In PBRS, we then define $F$ (the shaping function) as follows

$$ F\left(s, a, s^{\prime}\right)=\gamma \Phi\left(s^{\prime}\right)-\Phi(s), \tag{2}\label{2} $$ where $\Phi: S \mapsto \mathbb{R}$ is a real-valued function that indicates the desirability of being in a specific state. So, if we define $F$ as defined in \ref{2}, we call it a potential-based shaping function.

The intuition of $F$ in \ref{2} is that it avoids the agent to "go in circles" to get more and more reward. To be more precise, let's say that there's a state $s^* = s_t$ that is desirable but not the goal state. If you shaped the reward function by adding a positive reward (e.g. 5) to the agent whenever it got to that state $s^*$, it could just go back and forth to that state in order to get the reward without ever reaching the goal state (i.e. reward hacking). That's, of course, not desirable. So, if your current state is $s^*$ and your previous state was $s_{t-1}$, and whenever you get to $s_{t-1}$ you just shaped the reward function by adding a zero reward, to avoid going back to $s_{t-1}$ (i.e. to avoid that the next state $s_{t+1} = s_{t-1}$, and then go back to $s^*$ again, $s_{t+2} = s^*$, i.e. going in circles), if you use \ref{2}, you will add to your original sparse reward function the following

$$ F\left(s, a, s^{\prime}\right)=\gamma 0 - 5 = -5, \tag{3}\label{3} $$

In other words, going back to $s_{t+1}$ (to later try to go back to $s^*$) is punished because this could again lead you to go back to $s^*$. So, if you define $F$ as in equation \ref{2}, you avoid reward hacking, which can arise if you shape your RF in an ad-hoc manner (i.e. "as it seems good to you").

One of the first papers that reported this "going in circles" behaviour was Learning to Drive a Bicycle using Reinforcement Learning and Shaping

We agree with Mataric [Mataric, 1994] that these heterogeneous reinforcement functions have to be designed with great care. In our first experiments we rewarded the agent for driving towards the goal but did not punish it for driving away from it. Consequently the agent drove in circles with a radius of 20–50 meters around the starting point. Such behavior was actually rewarded by the reinforcement function

You probably also want to watch this video.

Another approach to solving the sparse rewards problem is to learn a reward function from an expert/optimal policy or from demonstrations (inverse reinforcement learning) or to completely avoid using the reward function and simply learn the policy directly from demonstrations (imitation learning).

Note that I'm not saying that any of these solutions fully solves all your problems. They have advantages (e.g. if done correctly, PBRS can speed the learning process) but also disadvantages (e.g., in the case of IRL, if your expert demonstrations are scarce, then your learned RF may also not be good), which I will not discuss further here (also because I don't currently have much practical experience with none of these techniques).

nbro
  • 39,006
  • 12
  • 98
  • 176
  • This answers my question and shows how a PBRS function can be designed. But I will say that the conditions for designing a good $\Phi$ are hard to satisfy. In the examples given in that paper it requires knowledge of the goal state to come up with Manhattan distance from goal. In this case the designer is being asked to come up with the goals/subgoals directly. This gets much more complex as the number of absorbing states increase (think chess again where there are countless variations of 'checkmate') or when there is an infinite time horizon. Can you point to any papers that attempt this? – tryingtolearn Feb 10 '21 at 15:47
  • 1
    @tryingtolearn I think this would be another great question for the site, i.e. how to come up with a good $\Phi$. I still don't know the answer, but I may investigate it. I only read the paper "Policy invariance under reward transformations: Theory and application to reward shaping", and a few other related ones, but none of the papers that I've read so far go into the direction of what you're asking. So, I suggest that you ask another question, maybe someone already has a good answer for it. – nbro Feb 10 '21 at 15:49