Conceptual clarification on Value-based Reinforcement Learning

Question

In slide 27/66, it states that "This is the maximum value achievable under any policy". My understanding of Value-based RL is that the optimal Q-table/function will be learnt. Afterwards, the implicit optimal policy can be "derived" by greedily picking the actions based on this learnt Q-value.

Does the implicit policy not matter here?

Probably related: https://ai.stackexchange.com/questions/40540/how-do-we-get-the-optimal-value-function — Luca Anzalone, May 30 '23 at 15:07
I have not idea what you're asking here. The implicit policy, which you don't define, is supposed to matter to what? Reformulate the question in a more specific and clearer way, please. — nbro, May 31 '23 at 22:42

score 1 · Answer 1 · answered May 30 '23 at 11:11

1

The slide 22 in your link says:

Once we have $Q^*$ we can act optimally: $$\pi^*(s) = \arg \max_a Q^*(s,a)$$

That is the optimal policy, which is greedy wrt $Q^*$.

answered May 30 '23 at 11:11

Kostya

2,416
7
23

Yeah, but there comes my confusion with the statement in slide 27. It then says "... under any policy". Does this mean this derived optimal policy is irrelevant? Does "any policy" here mean that we don't need the optimal policy to obtain the expected maximum value? – mathnoob May 31 '23 at 10:57

score 0 · Accepted Answer · answered May 31 '23 at 20:35

This is the maximum value achievable under any policy

The statement says that $Q^*(s,a)$ is the maximum possible state-action value that is achievable (at most) by any policy, $\pi$: including the random policy, and the optimal policy. It is also explained in slide 23:

"An optimal value function is the maximum achievable value"
"Once you have $Q^*$ we can act optimally"
"Optimal value maximizes over all decisions"

Indeed, the optimal policy $\pi^*$ would always achieve $Q^*$. It does not mean that the implicit policy (the one that can be derived from it: $\pi^*(s)=\arg\max_aQ^*(s,a)$) is not important, neither that it cannot be computed from $Q^*$.

You have to think about this in terms of policy evaluation (or prediction, in classical RL terminology): you have the optimal action-value table, that you can use to compute the value (or return) that some policy achieves, as well as to derive the optimal policy by taking the action that maximizes $Q^*$ for a given state.

Conceptual clarification on Value-based Reinforcement Learning

2 Answers2