1

The following is from the book "Reinforcement learning and optimal control", by D. P. Bertsekas.

Chapter 2, page 52:

"The motivation for $l$-step lookahead is that for increasing values of $l$, one may require a less accurate approximation $\tilde{J}_{k+l}$ to obtain good performance. Otherwise expressed, for the same quality of cost function approximation, better performance maybe obtained as $l$ becomes larger. This makes intuitive sense, since in this case, the cost of more stages is treated exactly with optimization."

My question:

Why is it that for increasing values of $l$, one may require a less accurate approximation $\tilde{J}_{k+l}$ to obtain good performance?

DSPinfinity
  • 301
  • 1
  • 8

1 Answers1

1

Let's first start with the intuitive explanation. If you just use q-learning (or any other temporal difference method like SARSA) you usually just look ahead one step. The other extreme is a monte-carlo method where you don't rely on estimates of the state-value function at all (your policy might depend on an estimate, but your updates don't). Monte-carlo methods can be viewed as an $\infty$-lookahead. So $l$-step lookaheads are somewhere in between (they take $l$ rewards in a monte carlo fashion and then only bootstrap from there), so they reduce the dependency on the estimates by quite some margin. Therefore you can have an estimate of the state-value function with higher variance for $l$-step lookahead as you could with only $1$-step lookaheads.

Now let's have a look at a more quantitative explanation. You are looking for the error reduction property (see derivation in Sutton-Barto notation) of $l$-step lookaheads, which can generally be given by \begin{equation} \min\limits_x \Bigl\lvert \mathbb{E}_{\pi}[T^{l-1}\tilde{J}_{t}| X_{t}=x] - J_{\pi}\Bigr\rvert \le \gamma^{l} \min\limits_s \Bigl\lvert \tilde{J}_{t} - J_{\pi}\Bigr\rvert \end{equation}

This shows that the $l$-step lookahead will always have lower errors than a simple $1$ step solution. Hence one can have worse estimations to achieve the same quality of results with multi-step learning.

pythonic833
  • 312
  • 1
  • 9