Why do we not calculate value for the done state in PPO?

Question

In PPO a common pattern I see in calcualting advantages is:

$$ delta = reward[t] + (gamma * valueNewState[t] * done[t]) - valueOldState[t]$$

Such as in this article. I am wondering why we multiply by done[t] indicating for the last step in the epoch we take reward only and not the value of the current state.

score 1 · Accepted Answer · answered Apr 03 '23 at 08:21

1

done[t] indicates whether the state at timestep t was terminal or not.

It is used to say when or not to bootstrap (i.e. estimate) the value V(s) of a state: we bootstrap when done[t] = False, providing an estimate for the future discounted reward that the agent should get in expectation (this is also useful when the episode is artificially truncated but not really terminated.) If true, the state at time t is terminal and so its value V(state[t]) should be zero, because the agent landed in a terminal state that "absorbs" the future rewards.

answered Apr 03 '23 at 08:21

Luca Anzalone

2,120
2
13

1

This question and corresponding answers (https://ai.stackexchange.com/q/17890/37607) add some detail to your last sentence. – DeepQZero Apr 03 '23 at 16:12
Thank you both this clears it up for me :) – Jacob B Apr 04 '23 at 04:19

Why do we not calculate value for the done state in PPO?

1 Answers1