Highest Voted 'off-policy-methods' Questions - Artificial Intelligence Stack Exchange

14

votes

1 answer

What is the relation between online (or offline) learning and on-policy (or off-policy) algorithms?

In the context of RL, there is the notion of on-policy and off-policy algorithms. I understand the difference between on-policy and off-policy algorithms. Moreover, in RL, there's also the notion of online and offline learning. What is the relation…

asked Feb 09 '19 at 14:48

nbro

39,006
12
98
176

11

votes

1 answer

Do off-policy policy gradient methods exist?

Do off-policy policy gradient methods exist? I know that policy gradient methods themselves using the policy function for sampling rollouts. But can't we easily have a model for sampling from the environment? If so, I've never seen this done before.

reinforcement-learning policy-gradients off-policy-methods

asked Dec 23 '17 at 18:41

echo

673
1
5
12

9

votes

1 answer

Why is the n-step tree backup algorithm an off-policy algorithm?

In reinforcement learning book from Sutton & Barto (2018 edition), specifically in section 7.5 of the book, they present an n-step off-policy algorithm that doesn't require importance sampling called n-step tree backup algorithm. In other…

reinforcement-learning off-policy-methods

asked Dec 13 '18 at 21:07

Brale

2,306
1
5
14

6

votes

2 answers

How can the importance sampling ratio be different than zero when the target policy is deterministic?

In the book Reinforcement Learning: An Introduction (2nd edition) Sutton and Barto define at page 104 (p. 126 of the pdf), equation (5.3), the importance sampling ratio, $\rho _{t:T-1}$, as follows: $$\rho…

reinforcement-learning off-policy-methods sutton-barto importance-sampling

asked Jan 09 '19 at 17:28

F.M.F.

311
3
7

6

votes

2 answers

What is the difference between on and off-policy deterministic actor-critic?

In the paper Deterministic Policy Gradient Algorithms, I am really confused about chapter 4.1 and 4.2 which is "On and off-policy Deterministic Actor-Critic". I don't know what's the difference between two algorithms. I only noticed that the…

reinforcement-learning terminology actor-critic-methods on-policy-methods off-policy-methods

asked May 09 '18 at 08:41

fish_tree

247
1
6

6

votes

1 answer

Is Expected SARSA an off-policy or on-policy algorithm?

I understand that SARSA is an On-policy algorithm, and Q-learning an off-policy one. Sutton and Barto's textbook describes Expected Sarsa thusly: In these cliff walking results Expected Sarsa was used on-policy, but in general it might use a…

reinforcement-learning off-policy-methods sarsa on-policy-methods expected-sarsa

asked Apr 20 '20 at 18:37

Y. Xu

63
1
4

5

votes

1 answer

Why do we need importance sampling?

I was studying the off-policy policy improvement method. Then I encountered importance sampling. I completely understood the mathematics behind the calculation, but I am wondering what is the practical example of importance sampling. For instance,…

reinforcement-learning monte-carlo-methods off-policy-methods importance-sampling

asked Jan 04 '21 at 01:43

Alireza Hosseini

51
2

5

votes

1 answer

Why does off-policy learning outperform on-policy learning?

I am self-studying about Reinforcement Learning using different online resources. I now have a basic understanding of how RL works. I saw this in a book: Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal…

reinforcement-learning comparison q-learning off-policy-methods on-policy-methods

asked Nov 26 '20 at 03:14

Exploring

223
6
16

5

votes

1 answer

Understanding the n-step off-policy SARSA update

In Sutton & Barto's book (2nd ed) page 149, there is the equation 7.11 I am having a hard time understanding this equation. I would have thought that we should be moving $Q$ towards $G$, where $G$ would be corrected by importance sampling, but only…

reinforcement-learning sutton-barto off-policy-methods temporal-difference-methods sarsa

asked Apr 05 '19 at 14:23

Antoine Savine

153
4

5

votes

1 answer

How do I compute the variance of the return of an evaluation policy using two behaviour policies?

Suppose there is an evaluation policy called $\pi_{e}$ and there are two behavior policies $\pi_{b1}$ and $\pi_{b2}$. I know that it is possible to estimate the return of policy $\pi_{e}$ through behavior policies via importance sampling, which is…

reinforcement-learning policies off-policy-methods

asked Jan 17 '19 at 19:45

Amin

471
2
11

4

votes

1 answer

What is the difference between on-policy and off-policy for continuous environments?

I'm trying to understand RL applied to time series (so with infinite horizon) which have a continous state space and a discrete action space. First, some preliminary questions: in this case, what is the optimal policy? Given the infinite horizon…

reinforcement-learning comparison q-learning off-policy-methods on-policy-methods

asked May 18 '20 at 15:11

unter_983

331
1
6

3

votes

1 answer

Why are Q values updated according to the greedy policy?

Apparently, in the Q-learning algorithm, the Q values are not updated according to the "current policy", but according to a "greedy policy". Why is that the case? I think this is related to the fact that Q-learning is off-policy, but I am also not…

reinforcement-learning q-learning off-policy-methods greedy-policy

asked Nov 17 '18 at 16:23

Shifat E Arman

83
1
5

3

votes

0 answers

How does off-policy monte carlo explore and converge?

Premises to question: Behavior Policy: e-greedy (stochastic) Target Policy: greedy (deterministic) Importance Sampling Included In off-policy Monte-Carlo control, the behavior policy chooses actions to follow, and the target policy learns from…

reinforcement-learning sutton-barto monte-carlo-methods off-policy-methods importance-sampling

asked Aug 17 '22 at 00:34

Jonah Kim

31
1

3

votes

1 answer

Why can we take the action $a$ from the next state $s'$ in the max part of the Q-learning update rule, if that action doesn't lead to any reward?

I'm using OpenAI's cartpole environment. First of all, is this environment not Markov? Knowing that, my main question concerns Q-learning and off-policy methods: For me, there is something weird in updating a Q value based on the max Q for a state…

q-learning markov-decision-process off-policy-methods on-policy-methods markov-property

asked Feb 18 '21 at 17:20

JeanMi

155
4

3

votes

1 answer

When learning off-policy with multi-step returns, why do we use the current behaviour policy in importance sampling?

When learning off-policy with multi-step returns, we want to update the value of $Q(s_1, a_1)$ using rewards from the trajectory $\tau = (s_1, a_1, r_1, s_2, a_2, r_2, ..., s_n, a_n, r_n, s_n+1)$. We want to learn the target policy $\pi$ while…

reinforcement-learning off-policy-methods value-functions importance-sampling return

asked Nov 16 '20 at 15:01

Federico Taschin

233
1
6

Questions tagged [off-policy-methods]