For questions related to off-policy reinforcement learning algorithms, which estimate a policy (the target policy) while using another policy (the behavior policy), during the learning process, which ensures that all states are sufficiently explored. An example of an off-policy algorithm is Q-learning.
Questions tagged [off-policy-methods]
70 questions
14
votes
1 answer
What is the relation between online (or offline) learning and on-policy (or off-policy) algorithms?
In the context of RL, there is the notion of on-policy and off-policy algorithms. I understand the difference between on-policy and off-policy algorithms. Moreover, in RL, there's also the notion of online and offline learning.
What is the relation…

nbro
- 39,006
- 12
- 98
- 176
11
votes
1 answer
Do off-policy policy gradient methods exist?
Do off-policy policy gradient methods exist?
I know that policy gradient methods themselves using the policy function for sampling rollouts. But can't we easily have a model for sampling from the environment? If so, I've never seen this done before.

echo
- 673
- 1
- 5
- 12
9
votes
1 answer
Why is the n-step tree backup algorithm an off-policy algorithm?
In reinforcement learning book from Sutton & Barto (2018 edition), specifically in section 7.5 of the book, they present an n-step off-policy algorithm that doesn't require importance sampling called n-step tree backup algorithm.
In other…

Brale
- 2,306
- 1
- 5
- 14
6
votes
2 answers
How can the importance sampling ratio be different than zero when the target policy is deterministic?
In the book Reinforcement Learning: An Introduction (2nd edition) Sutton and Barto define at page 104 (p. 126 of the pdf), equation (5.3), the importance sampling ratio, $\rho _{t:T-1}$, as follows:
$$\rho…

F.M.F.
- 311
- 3
- 7
6
votes
2 answers
What is the difference between on and off-policy deterministic actor-critic?
In the paper Deterministic Policy Gradient Algorithms, I am really confused about chapter 4.1 and 4.2 which is "On and off-policy Deterministic Actor-Critic".
I don't know what's the difference between two algorithms.
I only noticed that the…

fish_tree
- 247
- 1
- 6
6
votes
1 answer
Is Expected SARSA an off-policy or on-policy algorithm?
I understand that SARSA is an On-policy algorithm, and Q-learning an off-policy one.
Sutton and Barto's textbook describes Expected Sarsa thusly:
In these cliff walking results Expected Sarsa was used on-policy, but
in general it might use a…

Y. Xu
- 63
- 1
- 4
5
votes
1 answer
Why do we need importance sampling?
I was studying the off-policy policy improvement method. Then I encountered importance sampling. I completely understood the mathematics behind the calculation, but I am wondering what is the practical example of importance sampling.
For instance,…

Alireza Hosseini
- 51
- 2
5
votes
1 answer
Why does off-policy learning outperform on-policy learning?
I am self-studying about Reinforcement Learning using different online resources. I now have a basic understanding of how RL works.
I saw this in a book:
Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal…

Exploring
- 223
- 6
- 16
5
votes
1 answer
Understanding the n-step off-policy SARSA update
In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11
I am having a hard time understanding this equation.
I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…

Antoine Savine
- 153
- 4
5
votes
1 answer
How do I compute the variance of the return of an evaluation policy using two behaviour policies?
Suppose there is an evaluation policy called $\pi_{e}$ and there are two behavior policies $\pi_{b1}$ and $\pi_{b2}$. I know that it is possible to estimate the return of policy $\pi_{e}$ through behavior policies via importance sampling, which is…

Amin
- 471
- 2
- 11
4
votes
1 answer
What is the difference between on-policy and off-policy for continuous environments?
I'm trying to understand RL applied to time series (so with infinite horizon) which have a continous state space and a discrete action space.
First, some preliminary questions: in this case, what is the optimal policy? Given the infinite horizon…

unter_983
- 331
- 1
- 6
3
votes
1 answer
Why are Q values updated according to the greedy policy?
Apparently, in the Q-learning algorithm, the Q values are not updated according to the "current policy", but according to a "greedy policy". Why is that the case? I think this is related to the fact that Q-learning is off-policy, but I am also not…

Shifat E Arman
- 83
- 1
- 5
3
votes
0 answers
How does off-policy monte carlo explore and converge?
Premises to question:
Behavior Policy: e-greedy (stochastic)
Target Policy: greedy (deterministic)
Importance Sampling Included
In off-policy Monte-Carlo control, the behavior policy chooses actions to follow, and the target policy learns from…

Jonah Kim
- 31
- 1
3
votes
1 answer
Why can we take the action $a$ from the next state $s'$ in the max part of the Q-learning update rule, if that action doesn't lead to any reward?
I'm using OpenAI's cartpole environment. First of all, is this environment not Markov?
Knowing that, my main question concerns Q-learning and off-policy methods:
For me, there is something weird in updating a Q value based on the max Q for a state…

JeanMi
- 155
- 4
3
votes
1 answer
When learning off-policy with multi-step returns, why do we use the current behaviour policy in importance sampling?
When learning off-policy with multi-step returns, we want to update the value of $Q(s_1, a_1)$ using rewards from the trajectory $\tau = (s_1, a_1, r_1, s_2, a_2, r_2, ..., s_n, a_n, r_n, s_n+1)$. We want to learn the target policy $\pi$ while…

Federico Taschin
- 233
- 1
- 6