1

Is it correct that for SARSA to converge to the optimal value function (and policy)

  1. The learning rate parameter $\alpha$ must satisfy the conditions: $$\sum \alpha_{n^k(s,a)} =\infty \quad \text{and}\quad \sum \alpha_{n^k(s,a)}^{2} <\infty \quad \forall s \in \mathcal{S}$$ where $n_k(s,a)$ denotes the $k^\text{th}$ time $(s,a)$ is visited

  2. $\epsilon$ (of the $\epsilon$-greedy policy) must be decayed so that the policy converges to a greedy policy.

  3. Every state-action pair is visited infinitely many times.

Are any of these conditions redundant?

nbro
  • 39,006
  • 12
  • 98
  • 176
KaneM
  • 309
  • 2
  • 13
  • 1
    There is probably some relation between the fact that the TD(0) function converge to the true Value function with probability 1 when the learning rate parameter is decayed like above. There also exists some interval of alphas $\alpha \in (0,p)$ such that the TD estimate converges in expected value to the true value function – KaneM Feb 27 '20 at 13:17

2 Answers2

2

The paper Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms by Satinder Singh et al. proves that SARSA(0), in the case of a tabular representation of the value functions, converges to the optimal value function, provided certain assumptions are met

  1. Infinite visits to every state-action pair
  2. The learning policy becomes greedy in the limit

The properties are more formally stated in lemma 1 (page 7 of the pdf) and theorem 1 (page 8). The Robbins–Monro conditions should ensure that each state-action pair is visited infinitely often.

nbro
  • 39,006
  • 12
  • 98
  • 176
  • I was already writing this answer before [the other answer](https://ai.stackexchange.com/a/18279/2444) was published, but these answers are equivalent. I am only citing the paper that originally proved this. – nbro Feb 27 '20 at 13:40
  • Thanks for the answer. Sorry about my poor formatting of posts here. Just learning about markdown. – KaneM Feb 27 '20 at 13:45
  • 1
    @KaneM No problem! I have a look at source code of my and your answer (now after my edits) to understand better how markdown works :) Anyway, I am not completely sure that the Robbins-Monro conditions ensure that each state is visited infinitely often. Maybe it's the other way around: if each state-action pair is visited infinitely often, then the Robbins-Monro conditions are satisfied. – nbro Feb 27 '20 at 13:48
  • I believe they are separate conditions. From my understanding, the RM conditions ensure that the Q function eventually converges to **a** value while the condition that every state-action pair is visited infinitely often ensure that it converges to the correct value. – KaneM Feb 27 '20 at 13:59
  • 1
    @KaneM These conditions are very related. I'm linking to a paper that seems to relate the two in a more precise way, but I haven't yet fully read this paper. – nbro Feb 27 '20 at 14:03
  • 1
    I wonder if the Robbins-Monro conditions are not present, does there exist some range of $\alpha$ such that the policy converges in some notion of expectation to the optimal policy. – KaneM Feb 27 '20 at 14:32
  • @KaneM This is an interesting question. Maybe ask on the site! I wouldn't currently be able to answer it, but maybe there's already some research work that goes in this direction. In all convergence results I've seen so far, the RM conditions are assumed (essentially, because these RL theorems and proofs follow from the original results by Robbins and Monro, AFAIK). – nbro Feb 27 '20 at 14:43
1

I have the conditions for convergence in these notes SARSA convergence by Nahum Shimkin.

  1. The Robbins-Monro conditions above hold for $α_t$.

  2. Every state-action pair is visited infinitely often

  3. The policy is greedy with respect to the policy derived from $Q$ in the limit

  4. The controlled Markov chain is communicating: every state can be reached from any other with positive probability (under some policy).

  5. $\operatorname{Var}{R(s, a)} < \infty$, where $R$ is the reward function

nbro
  • 39,006
  • 12
  • 98
  • 176
KaneM
  • 309
  • 2
  • 13