For questions related to the various policy evaluation (PE) algorithms, which are numerical iterative algorithms that are used to find the value function associated with a given policy, which is often denoted as the "prediction problem". PE is also considered a dynamic programming method, which is regularly discussed in reinforcement learning textbooks.
Questions tagged [policy-evaluation]
14 questions
20
votes
2 answers
What is the difference between First-Visit Monte-Carlo and Every-Visit Monte-Carlo Policy Evaluation?
I came across these 2 algorithms, but I cannot understand the difference between these 2, both in terms of implementation as well as intuitionally.
So, what difference does the second point in both the slides refer to?
user9947
8
votes
2 answers
What is the proof that policy evaluation converges to the optimal solution?
Although I know how the algorithm of iterative policy evaluation using dynamic programming works, I am having a hard time realizing how it actually converges.
It appeals to intuition that, with each iteration, we get a better and better…

SAGALPREET SINGH
- 147
- 1
- 6
4
votes
1 answer
Why is update rule of the value function different in policy evaluation and policy iteration?
In the textbook "Reinforcement Learning: An Introduction", by Richard Sutton and Andrew Barto, the pseudo code for Policy Evaluation is given as follows:
The update equation for $V(s)$ comes from the Bellman equation for $v_{\pi}(s)$ which is…

Nishanth Rao
- 147
- 6
4
votes
3 answers
Why can the Bellman equation be turned into an update rule?
In chapter 4.1 of Sutton's book, the Bellman equation is turned into an update rule by simply changing the indices of it. How is it mathematically justified? I didn't quite get the initiation of why we are allowed to do that?
$$v_{\pi}(s) = \mathbb…

Saeid Ghafouri
- 113
- 5
4
votes
1 answer
How does policy evaluation work for continuous state space model-free approaches?
How does policy evaluation work for continuous state space model-free approaches?
Theoretically, a model-based approach for the discrete state and action space can be computed via dynamic programming and solving the Bellman equation.
Let's say you…

calveeen
- 1,251
- 7
- 17
2
votes
1 answer
Is the existence and uniqueness of the state-value function for $\gamma < 1$ theoretical?
Consider the following statement from 4.1 Policy Evaluation of the first edition of Sutton and Barto's book.
The existence and uniqueness of $V^{\pi}$ are guaranteed as long as
either $\gamma < 1$ or eventual termination is guaranteed from…

hanugm
- 3,571
- 3
- 18
- 50
2
votes
1 answer
Can we use Q-learning update for policy evaluation (not control)?
For policy evaluation purposes, can we use the Q-learning algorithm even though, technically, it is meant for control?
Maybe like this:
Have the policy to be evaluated as the behaviour policy.
Update the Q value conventionally (i.e. updating…

Dhruv Mullick
- 123
- 4
2
votes
1 answer
Why do we need to go back to policy evaluation after policy improvement if the policy is not stable?
Above is the algorithm for Policy Iteration from Sutton's RL book. So, step 2 actually looks like value iteration, and then, at step 3 (policy improvement), if the policy isn't stable it goes back to step 2.
I don't really understand this: it seems…

user8714896
- 717
- 1
- 4
- 21
2
votes
1 answer
Is value iteration stopped after one update of each state?
In section 4.4 Value Iteration, the authors write
One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration.
After that, they provide the following…

Alex
- 23
- 3
2
votes
1 answer
How can I implement policy evaluation when reward is tied to an action outcome?
I'm following Stanford reinforcement learning videos on youtube. One of the assignments asks to write code for policy evaluation for Gym's FrozenLake-v0 environment.
In the course (and books I have seen), they define policy evaluation…

Argod
- 23
- 2
2
votes
0 answers
Difficulty understanding Monte Carlo policy evaluation (state-value) for gridworld
I've been trying to read Sutton & Barto book chapter 5.1, but I'm still a bit confused about the procedure of using Monte Carlo policy evaluation (p.92), and now I just cant proceed anymore coding a python solution, because I feel like I don't fully…

Late347
- 59
- 4
1
vote
1 answer
Why is the update in-place faster than the out-of-place one in dynamic programming?
In Barto and Sutton's book, it's written that we have two types of updates in dynamic programming
Update out-of-place
Update in-place
The update in-place is the faster one. Why is that the case?
This is the pseudocode that I used to test it.
if…

VanasisB
- 13
- 3
1
vote
1 answer
Why isn't the implementation of my policy evaluation for a simple MDP converging?
I am trying to code out a policy evaluation algorithm to find the $V^\pi(s)$ for all states. The following diagram below shows the MDP.
In this case i let p = q = 0.5.
the rewards for each states are independent of action. I.e $r(\sigma_0)$ =…

calveeen
- 1,251
- 7
- 17
-1
votes
1 answer
Using states (features) and actions from a heuristic model to estimate the value function of a reinforcement learning agent
new to RL here.
As far as i understood from RL courses, that there is two sides of reinforcement learning. Policy Evaluation, which is the task of knowing the value function for certain policy. and Control, which is maximizing the reward or the…

Ramzy
- 3
- 5