2

I would much appreciate if you could point me in the right direction regarding this question about targets for SARSA and Q-learning (notation: $S$ is the current state, $A$ is the current action, $R$ is the reward, $S'$ is the next state and $A'$ is the action chosen from that next state).

Do we need an explicit policy for the Q-learning target to sample $A'$ from? And for SARSA?

I guess this is true for Q-learning since we need to get max Q-value which determines which action $A'$ we'll use for the update. For SARSA, we update the $Q(S, A)$ depending on which action was actually taken (no need for max). Please correct me if I'm wrong.

nbro
  • 39,006
  • 12
  • 98
  • 176
Novak
  • 123
  • 5

1 Answers1

3

Q-learning uses an exploratory policy, derived from the current estimate of the $Q$ function, such as the $\epsilon$-greedy policy, to select the action $a$ from the current state $s$. After having taken this action $a$ from $s$, the reward $r$ and the next state $s'$ are observed. At this point, to update the estimate of the $Q$ function, you use a target that assumes that the greedy action is taken from the next state $s'$. The greedy action is selected by the $\operatorname{max}$ operator, which can thus be thought of as an implicit policy (but this terminology isn't common, AFAIK), so, in this context, the greedy action is the action associated with the highest $Q$ value for the state $s'$.

In SARSA, no $\operatorname{max}$ operator is used, and you derive a policy (e.g. the $\epsilon$-greedy policy) from the current estimate of the $Q$ function to select both $a$ (from $s$) and $a'$ (from $s'$).

To conclude, in all cases, the policies are implicit, in the sense that they are derived from the estimate of the $Q$ function, but this isn't a common terminology. See also this answer, where I describe more in detail the differences between Q-learning and SARSA, and I also show the pseudocode of both algorithms, which you should read (multiple times) in order to fully understand their differences.

nbro
  • 39,006
  • 12
  • 98
  • 176
  • Thank you for your answer, I also checked the link you provided and it was very helpful. But I'm still not sure what it means to "sample `A'` from policy"? Does it mean that for Q-learning I do need an explicit policy (`max Q`, greedy) because I know which `A'` I will take to compute target (the one that gives me the `max Q`, regardless of the action actually taken in the environment). And for SARSA, I pick the action with some probability and compute target based on that `A'` (actually taken in environment). Could you elaborate on that please? – Novak Jan 29 '20 at 09:36
  • @Novak To sample $a'$ from a policy is the same thing as to sample $a$ from the same or another policy. If $\pi(a | s)$ represents a distribution, then $a ~ pi(a | s)$ is a sample from that distribution (in a statistical sense). In the case of Q-learning, you **assume** that the agent will take the best action from $s'$. You don't necessarily know the specific action $a'$, but, in Q-learning, $a'$ will be the best action (by assumption, because you use the max). The only action taken is $a$. See the pseudocode in my [other answer](https://ai.stackexchange.com/a/17626/2444). – nbro Jan 29 '20 at 11:44
  • @Novak See also this question [What is the relation between a policy which is the solution to an MDP and a policy like -greedy?](https://ai.stackexchange.com/q/10492/2444) I had asked. (Btw, use the symbols `$` for MathJax notation). – nbro Jan 29 '20 at 11:45