Questions tagged [value-iteration]

For questions related to the value iteration algorithm, which is a dynamic programming (DP) algorithm used to solve an MDP, that is, it is used to find a policy given the transition and reward functions of the MDP. Value iteration is related to another DP algorithm called policy iteration.

For more info, see e.g. http://www.incompleteideas.net/book/first/ebook/node44.html.

46 questions
7
votes
2 answers

In Value Iteration, why can we initialize the value function arbitrarily?

I have not been able to find a good explanation of this, other than statements that the algorithm is guaranteed to converge with arbitrary choices for initial values in each state. Is this something to do with the Bellman optimality constraint…
5
votes
1 answer

Should the reward or the Q value be clipped for reinforcement learning

When extending reinforcement learning to the continuous states, continuous action case, we must use function approximators (linear or non-linear) to approximate the Q-value. It is well known that non-linear function approximators, such as neural…
5
votes
1 answer

How is the fitted Q-iteration algorithm related to $Q^*(s, a)$, and how can we use function approximation with this algorithm?

I hope to get some clarifications on Fitted Q-Iteration (FQI). My Research So Far I've read Sutton's book (specifically, ch 6 to 10), Ernst et al and this paper. I know that $Q^*(s, a)$ expresses the expected value of first taking action $a$ from…
5
votes
0 answers

What exactly is non-delusional Q-learning?

Problems occur when we combine Q-learning with a function approximator. What exactly is the delusional-bias and non-delusional Q-learning? I am talking about the neurIPS 18 best paper Non-delusional Q-learning and value-iteration. I have trouble…
5
votes
2 answers

Why are policy iteration and value iteration studied as separate algorithms?

In Sutton and Barto's book about reinforcement learning, policy iteration and value iterations are presented as separate/different algorithms. This is very confusing because policy iteration includes an update/change of value and value iteration…
5
votes
1 answer

Why is my implementation of Q-learning not converging to the right values in the FrozenLake environment?

I am trying to learn tabular Q learning by using a table of states and actions (i.e. no neural networks). I was trying it out on the FrozenLake environment. It's a very simple environment, where the task is to reach a G starting from a source S…
4
votes
1 answer

What should the discount factor for the non-slippery version of the FrozenLake environment be?

I was working with FrozenLake 4x4 from open AI gym. In the slippery case, using a discounting factor of 1, my value iteration implementation was giving a success rate of around 75 percent. It was much worse for the 8x8 grid with success around 50%.…
4
votes
1 answer

Why doesn't value iteration use $\pi(a \mid s)$ while policy evaluation does?

I was looking at the Bellman equation, and I noticed a difference between the equations used in policy evaluation and value iteration. In policy evaluation, there was the presence of $\pi(a \mid s)$, which indicates the probability of choosing…
4
votes
1 answer

Why can't we apply value iteration when we do not know the reward and transition functions, and how does Q-learning solve this issue?

I don't understand why we can't apply value iteration when don't know the reward and transition probabilities. In this lecture, the lecturer says it has to do with not being able to take max with samples, but what does this mean? Why does Q-learning…
4
votes
2 answers

Would you categorize policy iteration as an actor-critic reinforcement learning approach?

One way of understanding the difference between value function approaches, policy approaches and actor-critic approaches in reinforcement learning is the following: A critic explicitly models a value function for a policy. An actor explicitly…
4
votes
1 answer

Understanding the update rule for the policy in the policy iteration algorithm

Consider the grid world problem in RL. Formally, policy in RL is defined as $\pi(a|s)$. If we are solving grid world by policy iteration then the following pseudocode is used: My question is related to the policy improvement step. Specifically, I…
4
votes
1 answer

A few questions regarding the difference between policy iteration and value iteration

The question already has some answer. But I am still finding it quite unclear (also does $\pi(s)$ here mean $q(s,a)$ ?): The few things I do not understand are: Why the difference between 2 iterations if we are acting greedily in each of them? As…
user9947
3
votes
1 answer

What is the time complexity of the value iteration algorithm?

Recently, I have come across the information (lecture 8 and 9 about MDPs of this UC Berkeley AI course) that the time complexity for each iteration of the value iteration algorithm is $\mathcal{O}(|S|^{2}|A|)$, where $|S|$ is the number of states…
3
votes
1 answer

Are policy and value iteration used only in grid world like scenarios?

I am trying to self learn reinforcement learning. At the moment I am focusing on policy and value iteration, and I am finding several problems and doubts. One of the main doubts is given by the fact that I can't find many diversified examples on how…
3
votes
2 answers

What is the value of a state when there is a certain probability that agent will die after each step?

We assume infinite horizon and discount factor $\gamma = 1$. At each step, after the agent takes an action and gets its reward, there is a probability $\alpha = 0.2$, that agent will die. The assumed maze looks like this Possible actions are go…
1
2 3 4