5

I'm actually trying to understand the policy iteration in the context of RL. I read an article presenting it and, at some point, a pseudo-code of the algorithm is given : enter image description here

What I can't understand is this line :

enter image description here

From what I understand, policy iteration is a model-free algorithm, which means that it doesn't need to know the environment's dynamics. But, in this line, we need $p(s',r \mid s, \pi(s))$ (which in my understanding is the transition function of the MDP that gave us the probability of landing in the state $s'$ knowing previous $s$ state and the action taken) to compute $V(s)$. So I don't understand how we can compute $V(s)$ with the quantity $p(s',r \mid s, \pi(s))$ since it is a parameter of the environment.

nbro
  • 39,006
  • 12
  • 98
  • 176

2 Answers2

3

Everything you say in your post is correct, apart from the wrong assumption that policy iteration is model-free. PI is a model-based algorithm because of the reasons you're mentioning.

See my answer to the question What's the difference between model-free and model-based reinforcement learning?.

nbro
  • 39,006
  • 12
  • 98
  • 176
0

The Policy Iteration algorithm (given in the question) is model-based.

However, note that there exist methods that fall into the Generalized Policy Iteration category, such as SARSA, which are model-free.

From what I understand, policy iteration is a model-free algorithm

Maybe this was referring to generalized policy iteration methods.


(Answer based on comments from @Neil Slater.)

dasWesen
  • 101
  • 2
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/130724/discussion-on-answer-by-daswesen-how-can-the-policy-iteration-algorithm-be-model). – nbro Oct 21 '21 at 12:59