2

The update equation for SARSA is $Q(S,A) = R + \gamma Q(S',A')$. Consider this: I take an action $A$ that leads to the terminal state. Now my $S'$ would be one of the terminal states. So...

  1. Intuitively, how does it make sense to take an action $A'$ when the environment already ended? Or is this something you just do anyway?

  2. Once a terminal state-action pair is reached, you update the previous state-action pair and then start the game loop all over again. But this means that the terminal state-action pair ($Q(S',A')$ in my example) is never updated. So, if your initial estimate of $Q(S',A')$ was wrong, you would never be able to fix it which would be very problematic. (And you can't set all the terminal values to zero because you are using function approximators)

So, how do I resolve these issues?

nbro
  • 39,006
  • 12
  • 98
  • 176
BOSSrobot
  • 41
  • 4

1 Answers1

1
  1. Intuitively, how does it make sense to take an action A' when the environment already ended?

It doesn't make sense, in that nothing can happen once the agent reaches a terminal state. However, it is often modelled as an "absorbing state" where the action is unimportant (either null or value ignored) with value by definition of $0$.

And you can't set all the terminal values to zero because you are using function approximators

The value is zero by definition. There is no need to approximate it. So don't use function approximators for action values in terminal states. When $S'$ is terminal, the update becomes:

$Q(S,A) \leftarrow Q(S,A) + \alpha(R - Q(S,A))$

Look at any implementation of Q learning and you will see a conditional calculation for the update value, that uses some variant of the above logic when $S'$ is terminal. For OpenAI Gym environments for instance, it will use the done flag.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60