How does Proximal Policy Optimization deal with sparse reward

Question

In the original paper, the objective of PPO is as follows:. My question is, how does this objective behave in a sparse reward setting (i.e., reward is only given after a sequence of actions were taken)? In this case we don't have $\hat{A}_{t}$ defined for every $t$.

Why won't we A_{hat}_{t} for every t? There is usually a tuple of (state, reward, next_state) associated with every time step t. For sparse reward settings, the reward will be 0 for non-reward states. — desert_ranger, Mar 06 '23 at 01:40
@desert_ranger yes you can think of reward as 0 for those states, but in some situations it might be undefined — Sam, Mar 06 '23 at 10:58
It is the user who designs the reward for each step. Therefore, as long as the environment is formulated correctly, this shouldn't happen. — desert_ranger, Mar 06 '23 at 23:28
@desert_ranger think of Go. By default, not every move has a reward assigned. Are you suggesting going down the reward-shaping route to introduce artificial rewards? — Sam, Mar 07 '23 at 02:45
A reward _must_ exist for every $t$. As we are operating in an MDP, each $t$ corresponds to a transition from a state $s$ to another state $s'$ given an action $a$. By the definition of the MDP, a reward _must_ be associated with this transition, otherwise you are not working in a proper MDP. This is true for the game of Go. I believe the work by Deepmind just assigned a score of +1 for winning, -1 for losing, and every intermediate step was assigned a reward of 0. — David, Apr 05 '23 at 10:23
@DavidIreland yeah so to fit into MDP framework, can assign dummy reward 0 to the intermediate states. — Sam, Apr 06 '23 at 14:06

How does Proximal Policy Optimization deal with sparse reward

0 Answers0