8

I recently started learning about reinforcement learning. Currently, I am trying to implement the SARSA algorithm. However, I do not know how to deal with $Q(s', a')$, when $s'$ is the terminal state. First, there is no action to choose from in this state. Second, this $Q$-factor will never be updated either because the episode ends when $s'$ is reached. Should I initialize $Q(s', a')$ to something other than a random number? Or should I just ignore the $Q$-factors and simply feed the reward $r$ into the update?

nbro
  • 39,006
  • 12
  • 98
  • 176
Hai Nguyen
  • 552
  • 4
  • 14

2 Answers2

5

The value $Q(s', ~\cdot~)$ should always be implemented to simply be equal to $0$ for any terminal state $s'$ (the dot instead of an action as second argument there indicates that what I just wrote should hold for any action, as long as $s'$ is terminal).

It is easier to understand why this should be the case by dissecting what the different terms in the update rule mean:

$$Q(s, a) \gets \color{red}{Q(s, a)} + \alpha \left[ \color{blue}{r + \gamma Q(s', a')} - \color{red}{Q(s, a)} \right]$$

In this update, the red term $\color{red}{Q(s, a)}$ (which appears twice) is our old estimate of the value $Q(s, a)$ of being in state $s$ and executing action $a$. The blue term $\color{blue}{r + \gamma Q(s', a')}$ is a different version of estimating the same quantity $Q(s, a)$. This second version is assumed to be slightly more accurate, because it is not "just" a prediction, but it's a combination of:

  • something that we really observed: $r$, plus
  • a prediction: $\gamma Q(s', a')$

Here, the $r$ component is the immediate reward that we observed after executing $a$ in $s$, and then $Q(s', a')$ is everything we expect to still be collecting afterwards (i.e., after executing $a$ in $s$ and transitioning to $s'$).

Now, suppose that $s'$ is a terminal state, what rewards do we still expect to be collecting in the future within that same episode? Since $s'$ is terminal, and the episode has ended, there can only be one correct answer; we expect to collect exactly $0$ rewards in the future.

Dennis Soemers
  • 9,894
  • 2
  • 25
  • 66
2

From the description of the algorithm you linked to, it says to 'repeat until s is terminal'. So one would end the episode at that point and your intuition holds.

Practically, if one was implementing a reward function where a specific reward is associated with the end of the episode such as "r(robot ran into a wall) = -100" then one can imagine that there is a terminal state just after this 'wall hit' state so the agent could see this reward. The episode would then be at a terminal state so would end.

Jaden Travnik
  • 3,767
  • 1
  • 16
  • 35