Understanding the n-step off-policy SARSA update

Question

In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11

I am having a hard time understanding this equation.

I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only $G$, not $G-Q$, therefore I would have thought that the correct equation would be of the form

$Q \leftarrow Q + \alpha (\rho G - Q)$

and not

$Q \leftarrow Q + \alpha \rho (G - Q)$

I don't get why the entire update is weighted by $\rho$ and not only the sampled return $G$.

Thank you @nbro for the edit, I was a bit lazy with the equations :) — Antoine Savine, Apr 05 '19 at 14:30
Hi Antoine! Please, next time try to at least put some more effort to write these equations! Whenever I can, I try to write them nicely, but I would prefer if every user does it for its own questions/answers, of course! — nbro, Apr 05 '19 at 14:31

score 3 · Accepted Answer · answered Apr 05 '19 at 20:44

Multiplying the entire update by $\rho$ has the desirable property that experience affects $Q$ less when the behavior policy is unrelated to the target policy. In the extreme, if the trajectory taken has zero probability under the target policy, then $Q$ isn't updated at all, which is good. Alternatively, if only $G$ is scaled by $\rho$, taking zero probability trajectories would artificially drive $Q$ to zero.

Understanding the n-step off-policy SARSA update

1 Answers1