Questions tagged [policy-improvement]

For questions related to the policy improvement (e.g. the PI step used in policy iteration).

7 questions
4
votes
1 answer

Understanding the update rule for the policy in the policy iteration algorithm

Consider the grid world problem in RL. Formally, policy in RL is defined as $\pi(a|s)$. If we are solving grid world by policy iteration then the following pseudocode is used: My question is related to the policy improvement step. Specifically, I…
2
votes
1 answer

Why do we need to go back to policy evaluation after policy improvement if the policy is not stable?

Above is the algorithm for Policy Iteration from Sutton's RL book. So, step 2 actually looks like value iteration, and then, at step 3 (policy improvement), if the policy isn't stable it goes back to step 2. I don't really understand this: it seems…
2
votes
1 answer

Is value iteration stopped after one update of each state?

In section 4.4 Value Iteration, the authors write One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration. After that, they provide the following…
1
vote
1 answer

In value iteration, what happens if we try to obtain the greedy policy while looping through the states?

I am referring to the Value Iteration (VI) algorithm as mentioned in Sutton's book below. Rather than getting the greedy deterministic policy after VI converges, what happens if we try to obtain the greedy policy while looping through the states…
1
vote
1 answer

How do we get from conditional expectation on both state and action to only state in the proof of the Policy Improvement Theorem?

I'm going through Sutton and Barto's book Reinforcement Learning: An Introduction and I'm trying to understand the proof of the Policy Improvement Theorem, presented at page 78 of the physical book. The theorem goes as follows: Let $\pi$ and $\pi'$…
1
vote
1 answer

Monte Carlo epsilon-greedy Policy Iteration: monotonic improvement for all cases or for the expected value?

I was going through university slides and this particular slide is trying to prove that in a Monte Carlo Policy Iteration algorithm using an epsilon-greedy policy, the state Values (V-Values) are monotonically improving. My question is about the…
0
votes
1 answer

How is the first line obtained and where is the information $v_{\pi}(x_k)=v_{\pi'}(x_k)$ used in the following derivation regarding greedy policy?

I did not understand how the equation below is obtained. Ie, how is the first line obtained and where is the information $v_{\pi}(x_k)=v_{\pi'}(x_k)$ used in the derivation regarding greedy policy? where $\pi^{\prime}(x_k)=argmax_{u_k \in…
DSPinfinity
  • 301
  • 1
  • 8