3

If I am training an agent to try and navigate a maze as fast as possible, a simple reward would be something like

\begin{align} R(\text{terminal}) &= N - \text{time}\ \ , \ \ N \gg \text{everything} \\ R(\text{state})& = 0\ \ \text{if not terminal} \end{align} i.e. when it reaches the terminal state, it receives a reward but one that decreases if it is slower. Actually I'm not sure if this is better or worse than $R(\text{terminal}) = 1 / \text{time}$, so please correct me if I'm wrong.

However, if the maze is really big, it could spend a long time wandering around before even encountering that reward. Are there any reliable ways of modifying the reward function to make the rewards less sparse? Assume that the agent knows the Euclidean distance between itself and the exit, just not the topography of the maze.

Is it at all sound to simply do something like

\begin{align} R(\text{current}) = (d_E(\text{start}, \text{exit}) - d_E(\text{current}, \text{exit})) + (\text{terminal}==True)*(N-\text{time})? \end{align}

Or if not, what kind of dense heuristic reward or other techniques might be better?

nbro
  • 39,006
  • 12
  • 98
  • 176
Paradox
  • 133
  • 3

1 Answers1

3

Doing something like the dense, distance-based reward signal you propose is possible... but you have to do it very carefully. If you're not careful, and do it in a naive manner, you are likely to reinforce unwanted behaviour.

For example, the way I read that reward function you propose, it provide a positive reward for any steps taken by the agent, with larger rewards for steps that get you closer to the goal (except for steps moving back into the start, those would have a reward of $0$. There does not appear to be any "compensation" with negative rewards for moves that take you back away from the goal; in fact, such steps also still seem to carry positive rewards! This means that the optimal behaviour that your agent can end up learning is to keep moving around in circles (somewhat close to the goal, but never quite stepping into the goal) for an infinite amount of time, continuously racking up those positive rewards.

The idea of adding some extra (heuristic) rewards to speed up learning is referred to as "reward shaping". Naive approaches to reward shaping often end up unintentionally modifying the "true" objective, as highlighted above. The correct way to implement reward shaping, which provably does not modify the optimal policy, is Potential-Based Reward Shaping. The basic intuition behind this is that, if you use reward shaping to encourage "movement" in one "direction", you should also provide equivalent (taking into account discount factor $\gamma$) discouragement for subsequent "movement" in the other "direction".

Now, there is this really cool paper named "Expressing Arbitrary Reward Functions as Potential-Based Advice" which proposes a method that can automatically convert from additional reward shaping functions specified in the more "natural" or "intuitive" manner like you did, into (approximately) a potential-based one that is more likely to actually function correctly. This is not quite straightforward though, and the approach involves learning an additional value function which makes additional predictions used to implement the "conversion". So... in practice, in a simple grid-world like yours, I think it's going to be simpler to just figure out the correct potential-based definition yourself than trying to learn it like this, but it's cool stuff nevertheless.

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66