1

Does the learning rate parameter $\alpha$ require the Robbins-Monro conditions below for the TD(0) algorithm to converge to the true value function of a policy?

$$\sum \alpha_t =\infty \quad \text{and}\quad \sum \alpha^{2}_t <\infty$$

nbro
  • 39,006
  • 12
  • 98
  • 176
KaneM
  • 309
  • 2
  • 13
  • This question was also asked at [https://stats.stackexchange.com/q/451115/82135](https://stats.stackexchange.com/q/451115/82135). – nbro Apr 15 '20 at 20:02

1 Answers1

1

The paper Convergence of Q-learning: A Simple Proof (by Francisco S. Melo) shows (theorem 1) that Q-learning, a TD(0) algorithm, converges with probability 1 to the optimal Q-function as long as the Robbins-Monro conditions, for all combinations of states and actions, are satisfied. In other words, the Robbins-Monro conditions are sufficient for Q-learning to converge to the optimal Q-function in the case of a finite MDP. The proof of theorem 1 uses another theorem from stochastic approximation (theorem 2).

You are interested in the prediction problem, that is, the problem of predicting the expected return (i.e. a value function) from a fixed policy. However, Q-learning is also a control algorithm, given that it can find the optimal policy from the corresponding learned Q-function.

See also the question Why doesn't Q-learning converge when using function approximation?.

nbro
  • 39,006
  • 12
  • 98
  • 176
  • Thanks for that, I was wondering is it true that TD(0) converges to the value function in Expected value when $\alpha$ is constant? Is there a condition that $\alpha \in (0,1)$? – KaneM Feb 24 '20 at 23:23
  • 1
    @KaneM Right now, I am not aware of this result. Please, ask another question on the site because may someone else can answer your question. – nbro Feb 24 '20 at 23:28
  • 1
    The result is in Sutton's [Learning to Predict by the Methods of Temporal Differences](http://incompleteideas.net/papers/dayan-92.pdf) , on page 24 of this version. I think I understand the result now. It is saying that there is some range of alphas $\alpha \in (0,t)$ such that TD(0) converges to the value function in expected value. – KaneM Feb 25 '20 at 00:05