Why can we take the action $a$ from the next state $s'$ in the max part of the Q-learning update rule, if that action doesn't lead to any reward?

Question

I'm using OpenAI's cartpole environment. First of all, is this environment not Markov?

Knowing that, my main question concerns Q-learning and off-policy methods:

For me, there is something weird in updating a Q value based on the max Q for a state and a reward value that was not from the action taken? How does this make learning better and makes you learn the optimal policy?

score 5 · Accepted Answer · edited Feb 21 '21 at 10:50

I'm using OpenAI's cartpole environment. First of all, is this environment not Markov?

The OpenAI Gym CartPole environment is Markov. Whether or not you know the transition probabilities does not affect whether the state has the Markov property. All that matters is that knowing the current state is enough to be determine the next state and reward in principle. You do not need to explicitly know the state transition model, or the reward function.

An example of a non-Markov state for CartPole would be if one of the features was missing - e.g. the current position of the cart, or the angular velocity of the pole. It is still possible to have agents attempt to solve CartPole with such missing data, but the state would no longer be Markov, and it would be a harder challenge.

For me, there is something weird in updating a Q value based on the max Q for a state and a reward value that was not from the action taken? How does this make learning better and makes you learn the optimal policy?

To recap, you are referring to this update equation, or perhaps some variation of it:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma\text{max}_{a'}Q(S_{t+1},a') - Q(S_t, A_t))$$

Another correction here, the reward value is related to the action taken and the experience that happened. This is important - the values of $R_{t+1}$ and $S_{t+1}$ are from the experience that was observed after taking action $A_t$, they do not come from anywhere else. In fact this is the only data from experience that is inserted into the update, everything else is based on action value estimates in a somewhat self-referential way (a.k.a. "bootstrapping").

How does this converge to an optimal policy? It follows a process known as generalised policy iteration (GPI), which works as follows:

The target policy is to always take the best action according to your current estimates
In off policy methods only (i.e. does not apply to all GPI), there is a conversion needed from experience in a behaviour policy to value estimates for the target policy. In single-step Q learning, this is where the $\text{max}_{a'}Q(S_{t+1},a')$ comes from. It is calculating the future value from following the target policy*. No interim values are needed, because you are estimating a new value for $Q(S_t,A_t)$, there are no other time steps where you need to care about difference between behaviour and target policy (there is just "now" and "all the future, which is bootstrapped")
Once you update the action value estimates in Q table, that automatically updates the target policy because it always takes the best estimated action.
Each update improves your value estimates, due to injecting some real experienced data in the form of $R_{t+1}$ and $S_{t+1}$.
Each time the estimates improve, the target policy can also improve and become closer to the optimal policy.
When the target policy changes, this makes the estimates less accurate, because they were made for the previous policy. However, the next time the same states and actions are visited they will be updated based on the new improved policy.

With a stochastic environment, it is possible for estimates to get worse due good or bad luck for the agent when it tries different actions. However, over time, the law of large numbers will win, and the value estimates will converge to close to their true values for the current policy, which will make policy improvement inevitable if it is possible (caveat, it is only close to inevitable for the tabular Q-learning method which can be proven to converge eventually on an optimal policy).

There is proof that always taking the maximising action is a strict improvement to a policy when estimates are accurate. It is called the Policy Improvement Theorem.

* To be clear, there is nothing particularly clever about taking the max here. It is literally what the target policy has been chosen to do, so the term $\text{max}_{a'}Q(S_{t+1},a')$ is just the same as $V_{\pi}(S_{t+1})$ - the value of being in state $S_{t+1}$ for the current target policy $\pi$.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/119944/discussion-on-answer-by-neil-slater-why-do-off-policy-method-use-a-reward-value). — nbro, Feb 19 '21 at 16:03
@nbro Now this discussion is gone (404 error), but there was some interesting points in there, is there anyway to retrieve it ? — JeanMi, Mar 14 '21 at 11:44
@JeanMi I have undeleted the room with the comments. Now, you should be able to access it. — nbro, Mar 14 '21 at 13:03

Why can we take the action $a$ from the next state $s'$ in the max part of the Q-learning update rule, if that action doesn't lead to any reward?

1 Answers1