What is meant by "two action selections" in SARSA?

Question

I have some difficulties understanding the difference between Q-learning and SARSA. Here (What are the differences between SARSA and Q-learning?) the following updating formulas are given:

Q-Learning

$$Q(s,a) = Q(s,a) + \alpha (R_{t+1} + \gamma \max_aQ(s',a) - Q(s,a))$$

SARSA

$$Q(s,a) = Q(s,a) + \alpha (R_{t+1} + \gamma Q(s',a') - Q(s,a))$$

I know that SARSA is for on-policy learning while Q-learning is off-policy learning. So, in Q-learning, the epsilon-greedy policy (or epsilon-soft or softmax policy) is chosen for selecting the actions and the greedy policy is chosen for updating the Q-values. In SARSA the epsilon-greedy policy (or epsilon-soft or softmax policy) is chosen for selecting the actions and for updating the Q function.

So, actually, I have a question on that:

On this website (https://www.cse.unsw.edu.au/~cs9417ml/RL1/algorithms.html) there is written for SARSA

As you can see, there are two action selection steps needed, for determining the next state-action pair along with the first.

What is meant by two action selections? Normally you can only select one action per iteration. The other "selection" should be for the update.

nbro · Answer 1 · 2021-12-13T17:32:48.537

3

In my view, the best way to understand these algorithms is to read the pseudocode (multiple times, if necessary!).

Here's the pseudocode of Q-learning.

Here's the pseudocode of SARSA.

So, as you can see, in SARSA, we choose one action before the episode starts, and, during the episode, we choose (and take) again more actions. In both cases, we choose these actions with the same policy (e.g. $\epsilon$-greedy), which is derived from $Q$. In Q-learning, we do not choose an action before the episode starts. We only choose and take an action at each step of the episode (like in SARSA). Hence, in SARSA, we choose actions in two places (but only take an action at each step of the episode). Note the difference between choosing/selecting an action and taking an action in the environment (you may just choose an action to update the Q-function, i.e. without taking it into the environment!).

edited Dec 13 '21 at 17:32

answered Dec 13 '21 at 17:16

nbro

39,006
12
98
176

Thanks for your answer nbro. Unfortunately I have to admit that I still have problems understanding it and I have several follow-up questions. 1) Why are we choosing more than 1 action in SARSA? One for going into the next state and the other one for updating the Q function? 2) In SARSA do we only select one action at the very beginning and then we always choose the same action for each step? Does it really make sense to choose the same initally choosen action a regardless of the sate s – PeterBe Dec 14 '21 at 15:00
3) Are the two policies in SARSA for choosing an action equal? I guess yes, because it is called an on-policy learning algorithm. 4) How can we calculate the Q(s', a') in both SARSA and Q-learning for updating the Q-function. After having taken an action a at state s we get the reward r which we can observe. But we cannot observe Q(s',a') from the environment as far as I see it. – PeterBe Dec 14 '21 at 15:00
@PeterBe Create **one post for each of these specific questions** and I or someone else will answer them. Please, really only one question per post and provide the context for each question in their separate post. – nbro Dec 14 '21 at 16:28
Thanks for your answer and effort. I understand that generally each question should be asked in a separate post. However, at least the first 2 questions of mine are direct follow-up questions to your answer. It does not make sense to ask question 1) and 2) in a separate post, because I would exactly ask the same think as in this post (question 1) or react directly to something you mentioned in your answer (question 2). – PeterBe Dec 14 '21 at 16:32
@PeterBe In this answer, I am answering the question "What is meant by two action selections?" in the context of SARSA. This is actually different than explaining **why** we have 2 action selections in SARSA. If it's a follow-up question or not, it doesn't matter. Your questions could be useful for future readers, that's why I am asking you to ask each of your specific questions in their separate post. More people might have those same questions (but not exactly the same combination of questions: that's why it's important to ask each specific question in a separate post). – nbro Dec 14 '21 at 16:38
Thanks for your comment. Actually my first 2 questions were also targeting at "what" are the 2 actions. I still don't understand what they are. As asked in my question 1) is one action for going into the next state and the other one for updating the Q function? and as asked in question 2) what is the first action really about? Is it one action that will not change througout all episodes? What is the first action doing and what is the second action doing? I have problems understanding this from your answer. This is why I think that at least those 2 questions are directly related – PeterBe Dec 14 '21 at 16:51
@PeterBe I will answer all your questions, if you just ask them in a separate post. – nbro Dec 14 '21 at 16:59

What is meant by "two action selections" in SARSA?

Q-Learning

SARSA

1 Answers1

Linked