Highest Voted 'policy-improvement' Questions - Artificial Intelligence Stack Exchange

4

votes

1 answer

Understanding the update rule for the policy in the policy iteration algorithm

Consider the grid world problem in RL. Formally, policy in RL is defined as $\pi(a|s)$. If we are solving grid world by policy iteration then the following pseudocode is used: My question is related to the policy improvement step. Specifically, I…

asked May 12 '19 at 11:15

user9947

2

votes

1 answer

Why do we need to go back to policy evaluation after policy improvement if the policy is not stable?

Above is the algorithm for Policy Iteration from Sutton's RL book. So, step 2 actually looks like value iteration, and then, at step 3 (policy improvement), if the policy isn't stable it goes back to step 2. I don't really understand this: it seems…

reinforcement-learning value-iteration policy-iteration policy-evaluation policy-improvement

asked Sep 06 '20 at 03:45

user8714896

717
1
4
21

2

votes

1 answer

Is value iteration stopped after one update of each state?

In section 4.4 Value Iteration, the authors write One important special case is when policy evaluation is stopped after just one sweep (one update of each state). This algorithm is called value iteration. After that, they provide the following…

reinforcement-learning value-iteration policy-evaluation pseudocode policy-improvement

asked Aug 13 '20 at 20:00

Alex

23
3

1

vote

1 answer

In value iteration, what happens if we try to obtain the greedy policy while looping through the states?

I am referring to the Value Iteration (VI) algorithm as mentioned in Sutton's book below. Rather than getting the greedy deterministic policy after VI converges, what happens if we try to obtain the greedy policy while looping through the states…

reinforcement-learning value-iteration policy-improvement

asked Jun 28 '21 at 10:42

user529295

359
1
10

1

vote

1 answer

How do we get from conditional expectation on both state and action to only state in the proof of the Policy Improvement Theorem?

I'm going through Sutton and Barto's book Reinforcement Learning: An Introduction and I'm trying to understand the proof of the Policy Improvement Theorem, presented at page 78 of the physical book. The theorem goes as follows: Let $\pi$ and $\pi'$…

reinforcement-learning dynamic-programming policy-improvement policy-improvement-theorem

asked Jun 22 '21 at 15:12

Daviiid

563
3
15

1

vote

1 answer

Monte Carlo epsilon-greedy Policy Iteration: monotonic improvement for all cases or for the expected value?

I was going through university slides and this particular slide is trying to prove that in a Monte Carlo Policy Iteration algorithm using an epsilon-greedy policy, the state Values (V-Values) are monotonically improving. My question is about the…

reinforcement-learning proofs monte-carlo-methods policy-iteration policy-improvement

asked Apr 25 '20 at 20:06

devidduma

532
3
10

0

votes

1 answer

How is the first line obtained and where is the information $v_{\pi}(x_k)=v_{\pi'}(x_k)$ used in the following derivation regarding greedy policy?

I did not understand how the equation below is obtained. Ie, how is the first line obtained and where is the information $v_{\pi}(x_k)=v_{\pi'}(x_k)$ used in the derivation regarding greedy policy? where $\pi^{\prime}(x_k)=argmax_{u_k \in…

reinforcement-learning policy-improvement

asked Feb 02 '23 at 17:36

DSPinfinity

301
1
8

Questions tagged [policy-improvement]

Understanding the update rule for the policy in the policy iteration algorithm

Why do we need to go back to policy evaluation after policy improvement if the policy is not stable?

Is value iteration stopped after one update of each state?

In value iteration, what happens if we try to obtain the greedy policy while looping through the states?

How do we get from conditional expectation on both state and action to only state in the proof of the Policy Improvement Theorem?

Monte Carlo epsilon-greedy Policy Iteration: monotonic improvement for all cases or for the expected value?

How is the first line obtained and where is the information $v_{\pi}(x_k)=v_{\pi'}(x_k)$ used in the following derivation regarding greedy policy?