Highest Voted 'policies' Questions - Artificial Intelligence Stack Exchange

16

votes

3 answers

Is the optimal policy always stochastic if the environment is also stochastic?

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic? Intuitively, if the environment is deterministic (that is, if the agent is in a state $s$ and…

asked Feb 15 '19 at 13:20

nbro

39,006
12
98
176

15

votes

4 answers

What does "stationary" mean in the context of reinforcement learning?

I think I've seen the expressions "stationary data", "stationary dynamics" and "stationary policy", among others, in the context of reinforcement learning. What does it mean? I think stationary policy means that the policy does not depend on time,…

reinforcement-learning terminology policies stationary-policy

asked Aug 20 '18 at 10:09

Paula Vega

428
4
8

8

votes

1 answer

What is the difference between a stationary and a non-stationary policy?

In reinforcement learning, there are deterministic and non-deterministic (or stochastic) policies, but there are also stationary and non-stationary policies. What is the difference between a stationary and a non-stationary policy? How do you…

reinforcement-learning comparison policies stationary-policy

asked Jun 27 '19 at 15:14

nbro

39,006
12
98
176

8

votes

3 answers

What is the difference between a stochastic and a deterministic policy?

In reinforcement learning, there are the concepts of stochastic (or probabilistic) and deterministic policies. What is the difference between them?

reinforcement-learning comparison policies deterministic-policy stochastic-policy

asked May 12 '19 at 18:50

nbro

39,006
12
98
176

6

votes

0 answers

Proof that there always exists a dominating policy in an MDP

I think that it is common knowledge that for any infinite horizon discounted MDP $(S, A, P, r, \gamma)$, there always exists a dominating policy $\pi$, i.e. a policy $\pi$ such that for all policies $\pi'$: $$V_\pi (s) \geq V_{\pi'}(s) \quad…

reinforcement-learning markov-decision-process proofs policies

asked Jul 16 '21 at 16:21

MMM

185
3

6

votes

1 answer

What is the relation between a policy which is the solution to a MDP and a policy like $\epsilon$-greedy?

In the context of reinforcement learning, a policy, $\pi$, is often defined as a function from the space of states, $\mathcal{S}$, to the space of actions, $\mathcal{A}$, that is, $\pi : \mathcal{S} \rightarrow \mathcal{A}$. This function is the…

reinforcement-learning definitions markov-decision-process policies exploration-strategies

asked Feb 10 '19 at 16:34

nbro

39,006
12
98
176

5

votes

2 answers

Why is the derivative of this objective function 0 if the policy is deterministic?

In the Berkeley RL class CS294-112 Fa18 9/5/18, they mention the following gradient would be 0 if the policy is deterministic. $$ \nabla_{\theta} J(\theta)=E_{\tau \sim \pi_{\theta}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log…

reinforcement-learning policy-gradients policies gradient calculus

asked Sep 06 '18 at 12:44

jonperl

153
7

5

votes

2 answers

Given two optimal policies, is an affine combination of them also optimal?

If there are two different optimal policies $\pi_1, \pi_2$ in a reinforcement learning task, will the linear combination (or affine combination) of the two policies $\alpha \pi_1 + \beta \pi_2, \alpha + \beta = 1$ also be an optimal policy? Here I…

reinforcement-learning proofs policies optimal-policy optimality

asked Nov 18 '20 at 07:04

yang liu

53
3

5

votes

1 answer

How do I compute the variance of the return of an evaluation policy using two behaviour policies?

Suppose there is an evaluation policy called $\pi_{e}$ and there are two behavior policies $\pi_{b1}$ and $\pi_{b2}$. I know that it is possible to estimate the return of policy $\pi_{e}$ through behavior policies via importance sampling, which is…

reinforcement-learning policies off-policy-methods

asked Jan 17 '19 at 19:45

Amin

471
2
11

4

votes

1 answer

An example of a unique value function which is associated with multiple optimal policies

In the 4th paragraph of http://www.incompleteideas.net/book/ebook/node37.html it is mentioned: Whereas the optimal value functions for states and state-action pairs are unique for a given MDP, there can be many optimal policies Could you please…

reinforcement-learning policies value-functions optimal-policy

asked Aug 20 '18 at 09:01

Melanie A

143
2

4

votes

1 answer

Why do we have two similar action selection strategies for UCB1?

In the literature, there are at least two action selection strategies associated with the UCB1's action selection strategy/policy. For example, in the paper Algorithms for the multi-armed bandit problem (2000/2014), at time step $t$, an action is…

reinforcement-learning policies sutton-barto multi-armed-bandits upper-confidence-bound

asked Oct 23 '20 at 10:32

nbro

39,006
12
98
176

4

votes

1 answer

Why doesn't value iteration use $\pi(a \mid s)$ while policy evaluation does?

I was looking at the Bellman equation, and I noticed a difference between the equations used in policy evaluation and value iteration. In policy evaluation, there was the presence of $\pi(a \mid s)$, which indicates the probability of choosing…

reinforcement-learning policies value-iteration policy-iteration bellman-equations

asked Aug 25 '20 at 12:35

Chukwudi Ogbonna

125
4

4

votes

1 answer

Why does having a fixed policy change a Markov Decision Process to a Markov Reward Process?

If a policy is fixed, it is said that a Markov Decision Process (MDP) becomes a Markov Reward Process (MRP). Why is this so? Aren't the transitions and rewards still parameterized by the action and current state? In other words, aren't the…

reinforcement-learning markov-decision-process policies markov-reward-process

asked Sep 30 '19 at 02:15

Peter

43
3

4

votes

2 answers

Why is having low variance important in offline policy evaluation of reinforcement learning?

Intuitively, I understand that having an unbiased estimate of a policy is important because being biased just means that our estimate is distant from the truth value. However, I don't understand clearly why having lower variance is important. Is…

reinforcement-learning policies bias-variance-tradeoff

asked Sep 29 '19 at 15:36

Hunnam

227
1
6

4

votes

1 answer

Can someone please help me validate my MDP?

Problem Statement : I have a system with four states - S1 through S4 where S1 is the beginning state and S4 is the end/terminal state. The next state is always better than the previous state i.e if the agent is at S2, it is in a slightly more…

reinforcement-learning rewards policies markov-decision-process

asked Sep 05 '19 at 15:08

Bhavana

83
6

Questions tagged [policies]