2

I was following a reinforcement learning course on coursera and in this video at 2:57 the instructor says

Expected SARSA and SARSA both allow us to learn an optimal $\epsilon$-soft policy, but, Q-learning does not

From what I understand, SARSA and Q-learning both give us an estimate of the optimal action-value function. SARSA does this on-policy with an epsilon-greedy policy, for example, whereas the action-values from the Q-learning algorithm are for a deterministic policy, which is always greedy.

But, can't we use these action values generated by the Q-learning algorithm to form an $\epsilon$-greedy policy?

We can, for instance, in each state, give the maximum probability to the action with the greatest action-value and the rest of the actions can have probability $\frac{\epsilon}{\text{number of actions}}$. Because we do a similar thing with SARSA, where we infer the policy from the current estimate of action-values after each update.

nbro
  • 39,006
  • 12
  • 98
  • 176

1 Answers1

8

If we assume a tabular setting, then Q-learning converges to the optimal state-action value function, from which an optimal policy can be derived, provided a few conditions are met.

In finite MDPs, there's at least one optimal (stationary) deterministic policy, but there can also be optimal stochastic policies - specifically, if two or more actions have the same optimal value, then you can stochastically choose between them. However, if you stochastically choose between all actions (including non-optimal ones), then you will not behave optimally.

SARSA also converges to the optimal state-action value function, but the learning policy must eventually become greedy. See this post and theorem 1 (p. 294) of this paper.

So, even in SARSA, if you want to behave optimally, you can't just arbitrarily choose any stochastic policy derived from this found optimal value function (note also that SARSA is on-policy). However, SARSA can also find an optimal restricted policy. See theorem 2 (p. 297) of this paper for more details.

To answer your question directly, Q-learning can find an optimal stochastic policy, provided that it exists.

nbro
  • 39,006
  • 12
  • 98
  • 176
  • So if I understand the post correctly, even in SARSA we decrease exploration (by decreasing epsilon in case of epsilon greedy) as time progresses to ensure that we have the optimal value function? – ketan dhanuka May 25 '22 at 18:40
  • 1
    @ketandhanuka Yes, as far as I remember, $\epsilon$ must eventually decay in order for SARSA to find the optimal value function. – nbro May 25 '22 at 18:41
  • 1
    $\epsilon$-greedy policies with a fixed $\epsilon$ can be ranked. So SARSA will converge on the optimal $\epsilon$-greedy policy (i.e. the optimal polciy *given that* a specific $\epsilon$ value applies) even without decaying $\epsilon$. Sometimes this is desirable, for instance in online scenarios that never complete learning. – Neil Slater May 26 '22 at 08:01
  • 1
    @NeilSlater I updated my answer to improve the precision of my explanations. SARSA converges to the optimal value function with decaying behaviour policies, but it can also converge to a policy called ([here](https://link.springer.com/content/pdf/10.1023/A:1007678930559.pdf)) an _optimal restricted policy_, which is the optimal policy that chooses actions according to the their rank (as far as I understand it). I am not sure if it's correct to say that SARSA converges to an optimal $\epsilon$-greedy policy, though - $\epsilon$-greedy policies don't choose actions according to their rank. – nbro May 26 '22 at 09:19
  • Maybe you're referring to another result about $\epsilon$-greedy policies in the context of SARSA. But, in [this paper](https://link.springer.com/content/pdf/10.1023/A:1007678930559.pdf), what you say only applies to restricted rank-based randomized (RRR) policies, i.e. policies that choose actions according to their ranks. – nbro May 26 '22 at 09:32
  • Yes I meant the restricted policy. By "ranked" I meant that the policies can be compared, so optimality is well defined. I used the wrong terminology, sorry. – Neil Slater May 26 '22 at 11:11