For questions related to policies (as defined in reinforcement learning or other AI sub-fields).
Questions tagged [policies]
76 questions
16
votes
3 answers
Is the optimal policy always stochastic if the environment is also stochastic?
Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic?
Intuitively, if the environment is deterministic (that is, if the agent is in a state $s$ and…

nbro
- 39,006
- 12
- 98
- 176
15
votes
4 answers
What does "stationary" mean in the context of reinforcement learning?
I think I've seen the expressions "stationary data", "stationary dynamics" and "stationary policy", among others, in the context of reinforcement learning. What does it mean? I think stationary policy means that the policy does not depend on time,…

Paula Vega
- 428
- 4
- 8
8
votes
1 answer
What is the difference between a stationary and a non-stationary policy?
In reinforcement learning, there are deterministic and non-deterministic (or stochastic) policies, but there are also stationary and non-stationary policies.
What is the difference between a stationary and a non-stationary policy? How do you…

nbro
- 39,006
- 12
- 98
- 176
8
votes
3 answers
What is the difference between a stochastic and a deterministic policy?
In reinforcement learning, there are the concepts of stochastic (or probabilistic) and deterministic policies. What is the difference between them?

nbro
- 39,006
- 12
- 98
- 176
6
votes
0 answers
Proof that there always exists a dominating policy in an MDP
I think that it is common knowledge that for any infinite horizon discounted MDP $(S, A, P, r, \gamma)$, there always exists a dominating policy $\pi$, i.e. a policy $\pi$ such that for all policies $\pi'$: $$V_\pi (s) \geq V_{\pi'}(s) \quad…

MMM
- 185
- 3
6
votes
1 answer
What is the relation between a policy which is the solution to a MDP and a policy like $\epsilon$-greedy?
In the context of reinforcement learning, a policy, $\pi$, is often defined as a function from the space of states, $\mathcal{S}$, to the space of actions, $\mathcal{A}$, that is, $\pi : \mathcal{S} \rightarrow \mathcal{A}$. This function is the…

nbro
- 39,006
- 12
- 98
- 176
5
votes
2 answers
Why is the derivative of this objective function 0 if the policy is deterministic?
In the Berkeley RL class CS294-112 Fa18 9/5/18, they mention the following gradient would be 0 if the policy is deterministic.
$$
\nabla_{\theta} J(\theta)=E_{\tau \sim \pi_{\theta}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log…

jonperl
- 153
- 7
5
votes
2 answers
Given two optimal policies, is an affine combination of them also optimal?
If there are two different optimal policies $\pi_1, \pi_2$ in a reinforcement learning task, will the linear combination (or affine combination) of the two policies $\alpha \pi_1 + \beta \pi_2, \alpha + \beta = 1$ also be an optimal policy?
Here I…

yang liu
- 53
- 3
5
votes
1 answer
How do I compute the variance of the return of an evaluation policy using two behaviour policies?
Suppose there is an evaluation policy called $\pi_{e}$ and there are two behavior policies $\pi_{b1}$ and $\pi_{b2}$. I know that it is possible to estimate the return of policy $\pi_{e}$ through behavior policies via importance sampling, which is…

Amin
- 471
- 2
- 11
4
votes
1 answer
An example of a unique value function which is associated with multiple optimal policies
In the 4th paragraph of
http://www.incompleteideas.net/book/ebook/node37.html
it is mentioned:
Whereas the optimal value functions for states and state-action pairs are unique for a given MDP, there can be many optimal policies
Could you please…

Melanie A
- 143
- 2
4
votes
1 answer
Why do we have two similar action selection strategies for UCB1?
In the literature, there are at least two action selection strategies associated with the UCB1's action selection strategy/policy. For example, in the paper Algorithms for the multi-armed bandit problem (2000/2014), at time step $t$, an action is…

nbro
- 39,006
- 12
- 98
- 176
4
votes
1 answer
Why doesn't value iteration use $\pi(a \mid s)$ while policy evaluation does?
I was looking at the Bellman equation, and I noticed a difference between the equations used in policy evaluation and value iteration.
In policy evaluation, there was the presence of $\pi(a \mid s)$, which indicates the probability of choosing…

Chukwudi Ogbonna
- 125
- 4
4
votes
1 answer
Why does having a fixed policy change a Markov Decision Process to a Markov Reward Process?
If a policy is fixed, it is said that a Markov Decision Process (MDP) becomes a Markov Reward Process (MRP).
Why is this so? Aren't the transitions and rewards still parameterized by the action and current state? In other words, aren't the…

Peter
- 43
- 3
4
votes
2 answers
Why is having low variance important in offline policy evaluation of reinforcement learning?
Intuitively, I understand that having an unbiased estimate of a policy is important because being biased just means that our estimate is distant from the truth value.
However, I don't understand clearly why having lower variance is important. Is…

Hunnam
- 227
- 1
- 6
4
votes
1 answer
Can someone please help me validate my MDP?
Problem Statement :
I have a system with four states - S1 through S4 where S1 is the beginning state and S4 is the end/terminal state. The next state is always better than the previous state i.e if the agent is at S2, it is in a slightly more…

Bhavana
- 83
- 6