1

In PPO a common pattern I see in calcualting advantages is:

$$ delta = reward[t] + (gamma * valueNewState[t] * done[t]) - valueOldState[t]$$

Such as in this article. I am wondering why we multiply by done[t] indicating for the last step in the epoch we take reward only and not the value of the current state.

Jacob B
  • 227
  • 2
  • 5

1 Answers1

1

done[t] indicates whether the state at timestep t was terminal or not.

It is used to say when or not to bootstrap (i.e. estimate) the value V(s) of a state: we bootstrap when done[t] = False, providing an estimate for the future discounted reward that the agent should get in expectation (this is also useful when the episode is artificially truncated but not really terminated.) If true, the state at time t is terminal and so its value V(state[t]) should be zero, because the agent landed in a terminal state that "absorbs" the future rewards.

Luca Anzalone
  • 2,120
  • 2
  • 13