Questions tagged [policies]

For questions related to policies (as defined in reinforcement learning or other AI sub-fields).

76 questions
16
votes
3 answers

Is the optimal policy always stochastic if the environment is also stochastic?

Is the optimal policy always stochastic (that is, a map from states to a probability distribution over actions) if the environment is also stochastic? Intuitively, if the environment is deterministic (that is, if the agent is in a state $s$ and…
15
votes
4 answers

What does "stationary" mean in the context of reinforcement learning?

I think I've seen the expressions "stationary data", "stationary dynamics" and "stationary policy", among others, in the context of reinforcement learning. What does it mean? I think stationary policy means that the policy does not depend on time,…
8
votes
1 answer

What is the difference between a stationary and a non-stationary policy?

In reinforcement learning, there are deterministic and non-deterministic (or stochastic) policies, but there are also stationary and non-stationary policies. What is the difference between a stationary and a non-stationary policy? How do you…
nbro
  • 39,006
  • 12
  • 98
  • 176
8
votes
3 answers

What is the difference between a stochastic and a deterministic policy?

In reinforcement learning, there are the concepts of stochastic (or probabilistic) and deterministic policies. What is the difference between them?
6
votes
0 answers

Proof that there always exists a dominating policy in an MDP

I think that it is common knowledge that for any infinite horizon discounted MDP $(S, A, P, r, \gamma)$, there always exists a dominating policy $\pi$, i.e. a policy $\pi$ such that for all policies $\pi'$: $$V_\pi (s) \geq V_{\pi'}(s) \quad…
6
votes
1 answer

What is the relation between a policy which is the solution to a MDP and a policy like $\epsilon$-greedy?

In the context of reinforcement learning, a policy, $\pi$, is often defined as a function from the space of states, $\mathcal{S}$, to the space of actions, $\mathcal{A}$, that is, $\pi : \mathcal{S} \rightarrow \mathcal{A}$. This function is the…
5
votes
2 answers

Why is the derivative of this objective function 0 if the policy is deterministic?

In the Berkeley RL class CS294-112 Fa18 9/5/18, they mention the following gradient would be 0 if the policy is deterministic. $$ \nabla_{\theta} J(\theta)=E_{\tau \sim \pi_{\theta}(\tau)}\left[\left(\sum_{t=1}^{T} \nabla_{\theta} \log…
5
votes
2 answers

Given two optimal policies, is an affine combination of them also optimal?

If there are two different optimal policies $\pi_1, \pi_2$ in a reinforcement learning task, will the linear combination (or affine combination) of the two policies $\alpha \pi_1 + \beta \pi_2, \alpha + \beta = 1$ also be an optimal policy? Here I…
5
votes
1 answer

How do I compute the variance of the return of an evaluation policy using two behaviour policies?

Suppose there is an evaluation policy called $\pi_{e}$ and there are two behavior policies $\pi_{b1}$ and $\pi_{b2}$. I know that it is possible to estimate the return of policy $\pi_{e}$ through behavior policies via importance sampling, which is…
Amin
  • 471
  • 2
  • 11
4
votes
1 answer

An example of a unique value function which is associated with multiple optimal policies

In the 4th paragraph of http://www.incompleteideas.net/book/ebook/node37.html it is mentioned: Whereas the optimal value functions for states and state-action pairs are unique for a given MDP, there can be many optimal policies Could you please…
4
votes
1 answer

Why do we have two similar action selection strategies for UCB1?

In the literature, there are at least two action selection strategies associated with the UCB1's action selection strategy/policy. For example, in the paper Algorithms for the multi-armed bandit problem (2000/2014), at time step $t$, an action is…
4
votes
1 answer

Why doesn't value iteration use $\pi(a \mid s)$ while policy evaluation does?

I was looking at the Bellman equation, and I noticed a difference between the equations used in policy evaluation and value iteration. In policy evaluation, there was the presence of $\pi(a \mid s)$, which indicates the probability of choosing…
4
votes
1 answer

Why does having a fixed policy change a Markov Decision Process to a Markov Reward Process?

If a policy is fixed, it is said that a Markov Decision Process (MDP) becomes a Markov Reward Process (MRP). Why is this so? Aren't the transitions and rewards still parameterized by the action and current state? In other words, aren't the…
4
votes
2 answers

Why is having low variance important in offline policy evaluation of reinforcement learning?

Intuitively, I understand that having an unbiased estimate of a policy is important because being biased just means that our estimate is distant from the truth value. However, I don't understand clearly why having lower variance is important. Is…
4
votes
1 answer

Can someone please help me validate my MDP?

Problem Statement : I have a system with four states - S1 through S4 where S1 is the beginning state and S4 is the end/terminal state. The next state is always better than the previous state i.e if the agent is at S2, it is in a slightly more…
1
2 3 4 5 6