4

I don't understand why we can't apply value iteration when don't know the reward and transition probabilities. In this lecture, the lecturer says it has to do with not being able to take max with samples, but what does this mean?

Why does Q-learning not need to know the reward and transition functions? In Q-learning, we also have a max, so I am not understanding.

nbro
  • 39,006
  • 12
  • 98
  • 176
Abhishek Bhatia
  • 427
  • 2
  • 5
  • 15

1 Answers1

3

For normal value iteration, you need to have the model, i.e. the transition probability, denoted by $P(s' \mid s,a)$.

With Q-learning, you use the current reward and the already stored Q value:

Q value update

The relation between the value function $V(s)$ and the $Q$ function $Q(s, a)$ is that the $V(s)$ function is simply the value of the action $a$, such that $Q(s, a)$ is the highest, that is, $V(s) = \max_a Q(s, a)$.

nbro
  • 39,006
  • 12
  • 98
  • 176
agold
  • 365
  • 2
  • 12