Is the optimal policy the one with the highest accumulative reward (Q-Learning vs SARSA)?

Question

The reward obtained with SARSA is higher. However, the path that Q learning chooses is eventually the optimal one, isn't it? Why is the SARSA reward higher if it is not choosing the best path? shouldn't the best path and the safe path both be the optimal one because the reward would be higher?

Neil Slater · Accepted Answer · 2022-01-06T22:05:15.643

It is important to note that the graph shows reward received during training. This includes rewards due to exploratory moves, which sometimes involve the agent falling off the cliff, even if it has already established that will lead to a large penalty. Q-learning does this more often than SARSA because Q learning targets learning values of the optimal greedy policy, whilst SARSA targets learning the values of the approximately optimal $\epsilon$-greedy policy. The cliff walking setup is designed to make these policies different.

The graph shows that during training, SARSA performs better at the task than Q learning. This may be an important consideration if mistakes during training have real expense (e.g. someone has to keep picking the agent robot off the floor whenever it falls off the cliff).

If you stopped training after episode 500 (assuming both agents had converged to accurate enough action value tables at that point), and ran both agents with the greedy policy based on their action values, then Q learning would score -13 per episode, and SARSA would be worse at -17 per episode. Both would perform better than during training, but Q learning would have the best trained policy.

To make SARSA and Q-learning equivalent in the long term, you would need to decay the exploration policy parameter $\epsilon$. If you did this during the training process, at a slow enough rate, ending with no exploration, then two approaches would converge to the same optimal policy and same reward per episode (of -13).

Mohith Kaameswaran Sakthivel · Answer 2 · 2022-01-08T05:46:57.847

Adding to Neil's reply, though the path shown is optimal, following the so-called 'optimal path' will often result in sub-optimal returns because the action selection in this problem is stochastic due to the $\epsilon$-greedy exploration. That is, even though if we are in a block right above the cliff region and know that the best action to do is to move right, this action is selected only with a probability of 0.9. This causes the agent to deviate from the optimal path and fall into the cliff, resulting in minimized returns.

As can be seen below, this effect is captured by the SARSA algorithm as it is on-policy. However, in Q-learning, the updates are done over actions that have maximal Q-value and not over the actual action that was taken. Hence, the sub-optimality induced by the $\epsilon$-greedy action selection is not captured in the Q-value table in Q-Learning.

$\epsilon$-greedy exploration is required during training only to facilitate exploration. Unless the value of $\epsilon$ is very small, the expected reward achieved by SARSA would be higher. However, as you can imagine in Q-learning there would be a few rollouts where the stochasticity works in the agent's favor and the agent does indeed successfully move along the optimal path achieving the maximum cumulative reward. Nevertheless, the expected reward over multiple rollouts would be less than that of SARSA.

References:

Reinforcement Learning: An Introduction - Rich Sutton and Andrew Barto, 2018.

The stochastic action selection is not part of the problem definiton (or at least not from the environment specification), but is a feature of the agents that are being compared. This is an important distinction, and saying whether Q-learning or SARSA is "better" at this task depends on context outside of the optimal control problem definiton. — Neil Slater, Jan 06 '22 at 16:40

Is the optimal policy the one with the highest accumulative reward (Q-Learning vs SARSA)?

2 Answers2