In PPO a common pattern I see in calcualting advantages is:
$$ delta = reward[t] + (gamma * valueNewState[t] * done[t]) - valueOldState[t]$$
Such as in this article. I am wondering why we multiply by done[t]
indicating for the last step in the epoch we take reward only and not the value of the current state.