Questions tagged [policy-iteration]

For questions related to the policy iteration (PI) algorithm, which is a dynamic programming algorithm that is used to solve a Markov decision process (MDP), given the transition model and reward function (i.e. the model) of the MDP.

35 questions
9
votes
2 answers

Why are the Bellman operators contractions?

In these slides, it is written \begin{align} \left\|T^{\pi} V-T^{\pi} U\right\|_{\infty} & \leq \gamma\|V-U\|_{\infty} \tag{9} \label{9} \\ \|T V-T U\|_{\infty} & \leq \gamma\|V-U\|_{\infty} \tag{10} \label{10} \end{align} where $F$ is the space of…
5
votes
0 answers

What exactly is non-delusional Q-learning?

Problems occur when we combine Q-learning with a function approximator. What exactly is the delusional-bias and non-delusional Q-learning? I am talking about the neurIPS 18 best paper Non-delusional Q-learning and value-iteration. I have trouble…
5
votes
2 answers

When to use Value Iteration vs. Policy Iteration

Both value iteration and policy iteration are General Policy Iteration (GPI) algorithms. However, they differ in the mechanics of their updates. Policy Iteration seeks to first find a completed value function for a policy, then derive the Q…
5
votes
1 answer

Does the policy iteration convergence hold for finite-horizon MDP?

Most RL books (Sutton & Barto, Bertsekas, etc.) talk about policy iteration for infinite-horizon MDPs. Does the policy iteration convergence hold for finite-horizon MDP? If yes, how can we derive the algorithm?
5
votes
2 answers

Why are policy iteration and value iteration studied as separate algorithms?

In Sutton and Barto's book about reinforcement learning, policy iteration and value iterations are presented as separate/different algorithms. This is very confusing because policy iteration includes an update/change of value and value iteration…
5
votes
2 answers

How can the policy iteration algorithm be model-free if it uses the transition probabilities?

I'm actually trying to understand the policy iteration in the context of RL. I read an article presenting it and, at some point, a pseudo-code of the algorithm is given : What I can't understand is this line : From what I understand, policy…
5
votes
1 answer

Why is my implementation of Q-learning not converging to the right values in the FrozenLake environment?

I am trying to learn tabular Q learning by using a table of states and actions (i.e. no neural networks). I was trying it out on the FrozenLake environment. It's a very simple environment, where the task is to reach a G starting from a source S…
4
votes
1 answer

Why doesn't value iteration use $\pi(a \mid s)$ while policy evaluation does?

I was looking at the Bellman equation, and I noticed a difference between the equations used in policy evaluation and value iteration. In policy evaluation, there was the presence of $\pi(a \mid s)$, which indicates the probability of choosing…
4
votes
1 answer

Why is update rule of the value function different in policy evaluation and policy iteration?

In the textbook "Reinforcement Learning: An Introduction", by Richard Sutton and Andrew Barto, the pseudo code for Policy Evaluation is given as follows: The update equation for $V(s)$ comes from the Bellman equation for $v_{\pi}(s)$ which is…
4
votes
2 answers

Would you categorize policy iteration as an actor-critic reinforcement learning approach?

One way of understanding the difference between value function approaches, policy approaches and actor-critic approaches in reinforcement learning is the following: A critic explicitly models a value function for a policy. An actor explicitly…
4
votes
1 answer

Understanding the update rule for the policy in the policy iteration algorithm

Consider the grid world problem in RL. Formally, policy in RL is defined as $\pi(a|s)$. If we are solving grid world by policy iteration then the following pseudocode is used: My question is related to the policy improvement step. Specifically, I…
3
votes
1 answer

Are policy and value iteration used only in grid world like scenarios?

I am trying to self learn reinforcement learning. At the moment I am focusing on policy and value iteration, and I am finding several problems and doubts. One of the main doubts is given by the fact that I can't find many diversified examples on how…
3
votes
0 answers

What are some strong algorithms for Perfect Information, Deterministic Multiplayer Games?

I have a series of games with the following properties: 3 or more players, but purely non-cooperative (i.e., no coalition forming); sequential moves; perfect information; deterministic state transitions and rewards; and game size is large enough to…
3
votes
1 answer

What is generalized policy iteration?

I am reading Sutton and Barto's material now. I know value iteration, which is an iterative algorithm taking the maximum value of adjacent states, and policy iteration. But what is generalized policy iteration?
3
votes
2 answers

Why do value iteration and policy iteration obtain similar policies even though they have different value functions?

I am trying to implement value and policy iteration algorithms. My value function from policy iteration looks vastly different from the values from value iteration, but the policy obtained from both is very similar. How is this possible? And what…
1
2 3