Why can't we apply value iteration when we do not know the reward and transition functions, and how does Q-learning solve this issue?

Question

I don't understand why we can't apply value iteration when don't know the reward and transition probabilities. In this lecture, the lecturer says it has to do with not being able to take max with samples, but what does this mean?

Why does Q-learning not need to know the reward and transition functions? In Q-learning, we also have a max, so I am not understanding.

score 3 · Answer 1 · edited Sep 27 '20 at 22:50

For normal value iteration, you need to have the model, i.e. the transition probability, denoted by $P(s' \mid s,a)$.

With Q-learning, you use the current reward and the already stored Q value:

Q value update

The relation between the value function $V(s)$ and the $Q$ function $Q(s, a)$ is that the $V(s)$ function is simply the value of the action $a$, such that $Q(s, a)$ is the highest, that is, $V(s) = \max_a Q(s, a)$.

Why can't we apply value iteration when we do not know the reward and transition functions, and how does Q-learning solve this issue?

1 Answers1