0

In a QTable you keep states and actions for the ongoing decision making, it somehow represents the knowledge of the world and your future decisions for this and any future instance of a game. In the other hand QTraining is considered model-free.

Is that contradictory or the QTable does not fully represent a Model of the game?

  • Does this answer your question? [What's the difference between model-free and model-based reinforcement learning?](https://ai.stackexchange.com/questions/4456/whats-the-difference-between-model-free-and-model-based-reinforcement-learning) – Neil Slater Aug 04 '23 at 18:03
  • I guess the answer is yes, but the problem is I dont see how XD – Raul Lapeira Herrero Aug 04 '23 at 18:15
  • 1
    Ok, is there any way you could explain what the gap is - what doesn't that other question and answer tell you, that means you are still unsure? – Neil Slater Aug 04 '23 at 18:21

3 Answers3

2

The Q table is a useful summary of the underlying Markov Decision Process (MDP) model description of the environment and available choices. A Q table summarises expected results for a single policy - in Q learning this is self-referential in that the policy it predicts for is the one where the agent greedily selects the action that predicts the best expected long-term return.

As a summary, the Q table is compressed in a way that is irreversible. It is not possible to derive the full MDP description of the environment from it.

The Q table is still a useful tool, because it can be learned from experience even if you don't know the MDP model. This is what model-free methods do.

In a general common language sense, the Q table is also a model, in that it can be used to make predictions (if the agent acts a certain way, it should expect to receive a certain amount of reward). However, the MDP model is the fully descriptive model for sequential decision problems (it predicts all individual outcomes, independently of any policy), so in RL when referring "the model", it almost always means the MDP.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
1

No, a model in RL is considered something (you can even consider it as a black box) that knows how to transition from one state to the next one, so for example, if we are playing snake, if I give you the current grid and the action where to move the head, an optimal model will have learnt the next game board, with the snake moved in the direction of the action

Then usually you can also learn the reward function with the model, but the model by itself tells you the next state you will be

Now, don't confuse it with the environment, the model learns to imitate the environment, but it's not it

After all of this, as you can see, $Q$ learning is not able to do that, it only tells you for each state and action what's the expected return, but not "if you are in this state $s_t$ and you do this action $a_t$ you'll end up at this state $s_{t+1}$"

Alberto
  • 591
  • 2
  • 10
1

The key is that Q-learning does not use $p(s' \mid s, a)$ - the transition model - in its standard formulation. The Q-table isn't this model. It's not even an estimate of this model because it doesn't give you any information about how to transition between states. The Q-table gives you an estimate of the expected return, so sum of rewards. So, you can predict the return, and Q-learning uses the estimated return in its update rule, but you cannot predict the next states, unless you augment Q-learning with that, but people don't usually do that in the standard versions.

Note: you could estimate $p(s' \mid s, a)$ in Q-learning by keeping track of the times you visit $s'$ when you took action $a$ in $s$.

I think the confusion is very understandable. There are many models. The Q-table is indeed a model too, but it's a model of the Q-function. If it was me, I'd abolish this horrible terminology - the worst part of RL. The on-policy vs off-policy is even worse. Literally, it's easier to understand the 10 lines of code of Q-learning than the terms off-policy and and model-free, which are just fancy names that pump it up, but you will eventually get used to them

nbro
  • 39,006
  • 12
  • 98
  • 176
  • The following is exactly my POV on this "it's easier to understand the 10 lines of code of Q-learning than the terms off-policy and and model-free, which are just fancy names that pump it up, but you will eventually get used to them" – Raul Lapeira Herrero Aug 05 '23 at 14:06
  • Your explanation is possibly better than the one I am choosing but I understood it better with the layman terms, thanks for the great detail – Raul Lapeira Herrero Aug 05 '23 at 16:49