1

From Sutton and Barto's book Reinforcement Learning (Adaptive Computation and Machine Learning series), are the following definitions:

enter image description here

enter image description here

To aid my learning of RL and gain an intuition, I'm focusing on the differences between some algorithms. I've selected Sarsa (on-policy TD control) for estimating Q ≈ q * and Q-learning (off-policy TD control) for estimating π ≈ π *.

For conciseness I'll refer to Sarsa (on-policy TD control) for estimating Q ≈ q * and Q-learning (off-policy TD control) for estimating π ≈ π * as Sarsa and Q-learning respectively.

Are my following assertions correct?

The primary differences are how the Q values are updated.

Sarsa Q value update: $ Q ( S, A ) ← Q ( S, A ) + α [ R + \gamma Q ( S ′ , A ′ ) − Q ( S, A ) ] $

Q-learning Q value update: $ Q ( S, A ) ← Q ( S, A ) + α [ R + \gamma \max_a Q ( S ′ , a ) − Q ( S, A ) ] $

Sarsa, in performing the td update subtracts the discounted Q value of the next state and action, S', A' from the Q value of the current state and action S, A. Q-learning, on the other hand, takes the discounted difference between the max action value for the Q value of the next state and current action S', a. Within the Q-learning episode loop the $a$ value is not updated, is an update made to $a$ during Q-learning?

Sarsa, unlike Q-learning, the current action is assigned to the next action at the end of each episode step. Q-learning does not assign the current action to the next action at the end of each episode step

Sarsa, unlike Q-learning, does not include the arg max as part of the update to Q value.

Sarsa and Q learning in choosing the initial action for each episode both use a "policy derived from Q", as an example, the epsilon greedy policy is given in the algorithm definition. But any policy could be used here instead of epsilon greedy? Q learning does not utilise the next state-action pair in performing the td update, it just utilises the next state and current action, this is given in the algorithm definition as $ Q ( S ′ , a ) $ what is $a$ in this case ?

nbro
  • 39,006
  • 12
  • 98
  • 176
blue-sky
  • 325
  • 1
  • 11
  • [here](https://ai.stackexchange.com/questions/21044/are-q-learning-and-sarsa-the-same-when-action-selection-is-greedy/21070#21070) is an answer that shows when the two are (almost) the same. – David Jun 15 '20 at 19:13
  • Hi. Please, next time, ask one question per post. Here you're asking too many questions. When you have multiple questions, ask each of them in its separate post, even though they are related. This helps answerers to focus on a single problem, so it facilitates the life of everyone that reads the posts. Please, read [our on-topic page](https://ai.stackexchange.com/help/on-topic) to understand the ideal question we are looking for. – nbro Jun 18 '20 at 13:32

1 Answers1

1

The main difference between the two is that Q-learning is an off policy algorithm. That is, we learn about an policy that is different to the one we choose to make actions. To see this, lets look at the update rule.

Q-Learning

$$Q(s,a) = Q(s,a) + \alpha (R_{t+1} + \gamma \max_aQ(s',a) - Q(s,a))$$

SARSA

$$Q(s,a) = Q(s,a) + \alpha (R_{t+1} + \gamma Q(s',a') - Q(s,a))$$

In SARSA we chose our $a'$ according to what our policy tells us to do when we are in state $s'$, so the policy we are learning about is also the policy that we choose to make our actions.

In Q-learning, we learn about the greedy policy whilst following some other policy, such as $\epsilon$-greedy. This is because when we transition into state $s'$ our TD-target becomes the maximum Q-value for whichever state we end up in, $s'$, where the max is taken over the actions.

Once we have actually updated our Q-function and we need to choose an action to take in $s'$, we do so from the policy we are using to generate our actions from -- thus we are learning about the greedy policy whilst following some other policy, hence off policy. In SARSA when we move into $s'$ our TD-target is chosen by the Q-value for the state we transition into and then the action we would choose based on our policy.

Within the Q-learning episode loop the $a$ value is not updated, is an update made to $a$ during Q-learning?

It will be, because the policy we use to choose our actions is ensured to explore sufficiently around all of the state-action pairs and so it is guaranteed to be encountered at some point.

Sarsa, unlike Q-learning, does not include the arg max as part of the update to Q value.

It is not an $\arg \max$, it is a $\max$. This is defined as $$\max_x f(x) = \{f(x) | \forall y\; : f(y) \leq f(x) \}$$

Sarsa, unlike Q-learning, the current action is assigned to the next action at the end of each episode step. Q-learning does not assign the current action to the next action at the end of each episode step

Kind of - the action that you chose for your TD-target in SARSA becomes the next action that you consider in the next step of the episode. This is natural because essentially you are in state $s$, you take action $a$ and observe a new state $s'$, at which point you can use your policy to see which action you will take, call this $a'$, and then perform the SARSA update, and then execute that action in the environment.

Sarsa and Q learning in choosing the initial action for each episode both use a "policy derived from Q", as an example, the epsilon greedy policy is given in the algorithm definition. But any policy could be used here instead of epsilon greedy?

Yes, any policy can be used although you want to choose a policy that allows sufficient exploration of the state-space.

Q learning does not utilise the next state-action pair in performing the td update, it just utilises the next state and current action, this is given in the algorithm definition as $Q(S',a)$ what is $a$ in this case ?

in the algorithm it actually has $\max_a Q(S',a)$, which if you refer back to my earlier definition of what the $\max$ operator does, should answer this question.

David
  • 4,591
  • 1
  • 6
  • 25
  • "It is not an argmax, it is a max." thanks for pointing this out, I mixed up taking the arg max in the function epsilon greedy with the calculation of the updated Sarsa Q value . – blue-sky Jun 16 '20 at 13:07
  • When you say "In SARSA we chose our $a'$ according to what our policy tells us to do when we are in state $s'$, so the policy we are learning about is also the policy that we choose to make our actions.". It's important to note that this is definitely **not** clear only from the update rule of SARSA, because the update rule in SARSA doesn't tell you anything about how the actions $a$ and $a'$ are chosen from the corresponding states. You need to explicitly say it. In Q-learning, this is less of a problem, one can understand that the max "represents the target policy". – nbro Nov 09 '20 at 13:49
  • What you need to clarify here (which you try to do) is that the behavior policy, such as $\epsilon$-greedy, is used to pick up the tuples $(s, a)$ for which you update $Q(s, a)$ in both the Q-learning and SARSA. The difference is that in SARSA you also use $\epsilon$-greedy (or any other behaviour policy that you used before to select $a$ from $s$) to take action $a'$ from $s'$ (as opposed to just select the greedy action, as in Q-learning). – nbro Nov 09 '20 at 13:49
  • In my opinion, the best way to understand the difference between SARSA and Q-learning is really to read the [pseudocode of both algorithms](https://www.cse.unsw.edu.au/~cs9417ml/RL1/algorithms.html) (actually provided in the question), and not just the update rules, which are not self-explanatory, as I am trying to say. – nbro Nov 09 '20 at 13:54