Why do bootstrapping methods produce nonstationary targets more than non-bootstrapping methods?

Asked Jun 27 '20 at 13:00

Active Jun 27 '20 at 13:54

Viewed 100 times

The following quote is taken from the beginning of the chapter on "Approximate Solution Methods" (p. 198) in "Reinforcement Learning" by Sutton & Barto (2018):

reinforcement learning generally requires function approximation methods able to handle nonstationary target functions (target functions that change over time). In control methods based on GPI (generalized policy iteration) we often seek to learn $q_\pi$ while $\pi$ changes. Even if the policy [pi] remains the same, the target values of training examples are nonstationary if they are generated by bootstrapping methods (DP and TD learning).

Could someone explain why the same is not the case if we use non-bootstrapping methods (such as Monte Carlo that is not allowed infinite rollouts)?

edited Jun 27 '20 at 13:54

nbro

39,006
12
98
176

asked Jun 27 '20 at 13:00

Johan

Why do bootstrapping methods produce nonstationary targets more than non-bootstrapping methods?

0 Answers0