Questions tagged [off-policy-methods]

For questions related to off-policy reinforcement learning algorithms, which estimate a policy (the target policy) while using another policy (the behavior policy), during the learning process, which ensures that all states are sufficiently explored. An example of an off-policy algorithm is Q-learning.

70 questions
14
votes
1 answer

What is the relation between online (or offline) learning and on-policy (or off-policy) algorithms?

In the context of RL, there is the notion of on-policy and off-policy algorithms. I understand the difference between on-policy and off-policy algorithms. Moreover, in RL, there's also the notion of online and offline learning. What is the relation…
11
votes
1 answer

Do off-policy policy gradient methods exist?

Do off-policy policy gradient methods exist? I know that policy gradient methods themselves using the policy function for sampling rollouts. But can't we easily have a model for sampling from the environment? If so, I've never seen this done before.
echo
  • 673
  • 1
  • 5
  • 12
9
votes
1 answer

Why is the n-step tree backup algorithm an off-policy algorithm?

In reinforcement learning book from Sutton & Barto (2018 edition), specifically in section 7.5 of the book, they present an n-step off-policy algorithm that doesn't require importance sampling called n-step tree backup algorithm. In other…
Brale
  • 2,306
  • 1
  • 5
  • 14
6
votes
2 answers

How can the importance sampling ratio be different than zero when the target policy is deterministic?

In the book Reinforcement Learning: An Introduction (2nd edition) Sutton and Barto define at page 104 (p. 126 of the pdf), equation (5.3), the importance sampling ratio, $\rho _{t:T-1}$, as follows: $$\rho…
6
votes
2 answers

What is the difference between on and off-policy deterministic actor-critic?

In the paper Deterministic Policy Gradient Algorithms, I am really confused about chapter 4.1 and 4.2 which is "On and off-policy Deterministic Actor-Critic". I don't know what's the difference between two algorithms. I only noticed that the…
6
votes
1 answer

Is Expected SARSA an off-policy or on-policy algorithm?

I understand that SARSA is an On-policy algorithm, and Q-learning an off-policy one. Sutton and Barto's textbook describes Expected Sarsa thusly: In these cliff walking results Expected Sarsa was used on-policy, but in general it might use a…
5
votes
1 answer

Why do we need importance sampling?

I was studying the off-policy policy improvement method. Then I encountered importance sampling. I completely understood the mathematics behind the calculation, but I am wondering what is the practical example of importance sampling. For instance,…
5
votes
1 answer

Why does off-policy learning outperform on-policy learning?

I am self-studying about Reinforcement Learning using different online resources. I now have a basic understanding of how RL works. I saw this in a book: Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal…
5
votes
1 answer

Understanding the n-step off-policy SARSA update

In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11 I am having a hard time understanding this equation. I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…
5
votes
1 answer

How do I compute the variance of the return of an evaluation policy using two behaviour policies?

Suppose there is an evaluation policy called $\pi_{e}$ and there are two behavior policies $\pi_{b1}$ and $\pi_{b2}$. I know that it is possible to estimate the return of policy $\pi_{e}$ through behavior policies via importance sampling, which is…
Amin
  • 471
  • 2
  • 11
4
votes
1 answer

What is the difference between on-policy and off-policy for continuous environments?

I'm trying to understand RL applied to time series (so with infinite horizon) which have a continous state space and a discrete action space. First, some preliminary questions: in this case, what is the optimal policy? Given the infinite horizon…
3
votes
1 answer

Why are Q values updated according to the greedy policy?

Apparently, in the Q-learning algorithm, the Q values are not updated according to the "current policy", but according to a "greedy policy". Why is that the case? I think this is related to the fact that Q-learning is off-policy, but I am also not…
3
votes
0 answers

How does off-policy monte carlo explore and converge?

Premises to question: Behavior Policy: e-greedy (stochastic) Target Policy: greedy (deterministic) Importance Sampling Included In off-policy Monte-Carlo control, the behavior policy chooses actions to follow, and the target policy learns from…
3
votes
1 answer

Why can we take the action $a$ from the next state $s'$ in the max part of the Q-learning update rule, if that action doesn't lead to any reward?

I'm using OpenAI's cartpole environment. First of all, is this environment not Markov? Knowing that, my main question concerns Q-learning and off-policy methods: For me, there is something weird in updating a Q value based on the max Q for a state…
3
votes
1 answer

When learning off-policy with multi-step returns, why do we use the current behaviour policy in importance sampling?

When learning off-policy with multi-step returns, we want to update the value of $Q(s_1, a_1)$ using rewards from the trajectory $\tau = (s_1, a_1, r_1, s_2, a_2, r_2, ..., s_n, a_n, r_n, s_n+1)$. We want to learn the target policy $\pi$ while…
1
2 3 4 5