Does TD(0) prediction require Robbins-Monro conditions to converge to the value function?

Question

Does the learning rate parameter $\alpha$ require the Robbins-Monro conditions below for the TD(0) algorithm to converge to the true value function of a policy?

$$\sum \alpha_t =\infty \quad \text{and}\quad \sum \alpha^{2}_t <\infty$$

This question was also asked at [https://stats.stackexchange.com/q/451115/82135](https://stats.stackexchange.com/q/451115/82135). — nbro, Apr 15 '20 at 20:02

nbro · Answer 1 · 2020-02-24T21:54:13.623

1

The paper Convergence of Q-learning: A Simple Proof (by Francisco S. Melo) shows (theorem 1) that Q-learning, a TD(0) algorithm, converges with probability 1 to the optimal Q-function as long as the Robbins-Monro conditions, for all combinations of states and actions, are satisfied. In other words, the Robbins-Monro conditions are sufficient for Q-learning to converge to the optimal Q-function in the case of a finite MDP. The proof of theorem 1 uses another theorem from stochastic approximation (theorem 2).

You are interested in the prediction problem, that is, the problem of predicting the expected return (i.e. a value function) from a fixed policy. However, Q-learning is also a control algorithm, given that it can find the optimal policy from the corresponding learned Q-function.

See also the question Why doesn't Q-learning converge when using function approximation?.

edited Feb 24 '20 at 21:54

answered Feb 24 '20 at 21:47

nbro

39,006
12
98
176

Thanks for that, I was wondering is it true that TD(0) converges to the value function in Expected value when $\alpha$ is constant? Is there a condition that $\alpha \in (0,1)$? – KaneM Feb 24 '20 at 23:23
1

@KaneM Right now, I am not aware of this result. Please, ask another question on the site because may someone else can answer your question. – nbro Feb 24 '20 at 23:28
1

The result is in Sutton's [Learning to Predict by the Methods of Temporal Differences](http://incompleteideas.net/papers/dayan-92.pdf) , on page 24 of this version. I think I understand the result now. It is saying that there is some range of alphas $\alpha \in (0,t)$ such that TD(0) converges to the value function in expected value. – KaneM Feb 25 '20 at 00:05

Does TD(0) prediction require Robbins-Monro conditions to converge to the value function?

1 Answers1