Highest Voted 'on-policy-methods' Questions - Artificial Intelligence Stack Exchange

14

votes

1 answer

What is the relation between online (or offline) learning and on-policy (or off-policy) algorithms?

In the context of RL, there is the notion of on-policy and off-policy algorithms. I understand the difference between on-policy and off-policy algorithms. Moreover, in RL, there's also the notion of online and offline learning. What is the relation…

asked Feb 09 '19 at 14:48

nbro

39,006
12
98
176

6

votes

2 answers

What is the difference between on and off-policy deterministic actor-critic?

In the paper Deterministic Policy Gradient Algorithms, I am really confused about chapter 4.1 and 4.2 which is "On and off-policy Deterministic Actor-Critic". I don't know what's the difference between two algorithms. I only noticed that the…

reinforcement-learning terminology actor-critic-methods on-policy-methods off-policy-methods

asked May 09 '18 at 08:41

fish_tree

247
1
6

6

votes

1 answer

If $\gamma \in (0,1)$, what is the on-policy state distribution for episodic tasks?

In Reinforcement Learning: An Introduction, section 9.2 (page 199), Sutton and Barto describe the on-policy distribution in episodic tasks, with $\gamma =1$, as being \begin{equation} \mu(s) = \frac{\eta(s)}{\sum_{k \in S}…

reinforcement-learning policy-gradients sutton-barto on-policy-methods discount-factor

asked May 13 '21 at 22:22

Felipe Costa

103
5

6

votes

1 answer

Is Expected SARSA an off-policy or on-policy algorithm?

I understand that SARSA is an On-policy algorithm, and Q-learning an off-policy one. Sutton and Barto's textbook describes Expected Sarsa thusly: In these cliff walking results Expected Sarsa was used on-policy, but in general it might use a…

reinforcement-learning off-policy-methods sarsa on-policy-methods expected-sarsa

asked Apr 20 '20 at 18:37

Y. Xu

63
1
4

6

votes

1 answer

Convergence of semi-gradient TD(0) with non-linear function approximation

I am looking for a result that shows the convergence of semi-gradient TD(0) algorithm with non-linear function approximation for on-policy prediction. Specifically, the update equation is given by (borrowing notation from Sutton and Barto…

reinforcement-learning convergence function-approximation temporal-difference-methods on-policy-methods

asked Nov 05 '19 at 16:48

srinivas tunuguntla

61
2

5

votes

1 answer

Why does off-policy learning outperform on-policy learning?

I am self-studying about Reinforcement Learning using different online resources. I now have a basic understanding of how RL works. I saw this in a book: Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal…

reinforcement-learning comparison q-learning off-policy-methods on-policy-methods

asked Nov 26 '20 at 03:14

Exploring

223
6
16

4

votes

1 answer

Why is GLIE Monte-Carlo control an on-policy control?

In slide 16 of his lecture 5 of the course "Reinforcement Learning", David Silver introduced GLIE Monte-Carlo Control. But why is it an on-policy control? The sampling follows a policy $\pi$ while improvement follows an $\epsilon$-greedy policy, so…

reinforcement-learning control-problem on-policy-methods monte-carlo-methods

asked May 22 '18 at 07:57

fish_tree

247
1
6

4

votes

1 answer

How should I generate datasets for a SARSA agent when the environment is not simple?

I am currently working on my master's thesis and going to apply Deep-SARSA as my DRL algorithm. The problem is that there is no datasets available and I guess that I should generate them somehow. Datasets generation seems a common feature in this…

reinforcement-learning datasets environment sarsa on-policy-methods

asked Jan 06 '21 at 07:26

Shahin

153
4

4

votes

1 answer

What does the term $|\mathcal{A}(s)|$ mean in the $\epsilon$-greedy policy?

I've been looking online for a while for a source that explains these computations but I can't find anywhere what does the $|A(s)|$ mean. I guess $A$ is the action set but I'm not sure about that notation: $$\frac{\varepsilon}{|\mathcal{A}(s)|}…

reinforcement-learning monte-carlo-methods notation on-policy-methods epsilon-greedy-policy

asked Jul 14 '20 at 20:11

Metrician

95
5

4

votes

1 answer

What is the difference between on-policy and off-policy for continuous environments?

I'm trying to understand RL applied to time series (so with infinite horizon) which have a continous state space and a discrete action space. First, some preliminary questions: in this case, what is the optimal policy? Given the infinite horizon…

reinforcement-learning comparison q-learning off-policy-methods on-policy-methods

asked May 18 '20 at 15:11

unter_983

331
1
6

3

votes

0 answers

How to fix high variance of the returns on a 2d env?

I'm trying to train an agent on a self-written 2d env, and it just doesn't converge to the solution. It is basically a 2d game where you have to move a small circle around the screen and try to avoid collisions with randomly moving "enemy" circles…

reinforcement-learning training convergence on-policy-methods online-learning

asked Nov 16 '21 at 12:08

debrises

31
3

3

votes

1 answer

Is it possible to apply a particular exploration policy for the on-policy RL agents?

Is it possible to use any particular strategy to explore (e.g. metaheuristics) in on-policy algorithms (e.g. in PPO) or is it only possible to define particular policies to explore in off-policy algorithms (e.g. TD3)?

reinforcement-learning deep-rl on-policy-methods exploration-strategies

asked Sep 18 '21 at 15:03

Pulse9

282
1
7

3

votes

1 answer

Why can we take the action $a$ from the next state $s'$ in the max part of the Q-learning update rule, if that action doesn't lead to any reward?

I'm using OpenAI's cartpole environment. First of all, is this environment not Markov? Knowing that, my main question concerns Q-learning and off-policy methods: For me, there is something weird in updating a Q value based on the max Q for a state…

q-learning markov-decision-process off-policy-methods on-policy-methods markov-property

asked Feb 18 '21 at 17:20

JeanMi

155
4

2

votes

1 answer

Why is the actor-critic algorithm limited to using on-policy data?

Why is the actor-critic algorithm limited to using on-policy data? Or can we use the actor-critic algorithm with off-policy data?

reinforcement-learning actor-critic-methods on-policy-methods off-policy-methods

asked Jan 06 '19 at 17:12

apuffin

31
2

2

votes

1 answer

Do we need the transition probability function when calculating the importance sampling ratio?

I am reading the book titled "Reinforcement Learning: An Introduction" (by Sutton and Barto). I am at chapter 5, which is about Monte Carlo methods, but now I am quite confused. There is one thing I don't particularly understand. Why do we need the…

reinforcement-learning on-policy-methods monte-carlo-methods importance-sampling dynamic-programming

asked Nov 16 '18 at 13:02

Manuel Pasieka

23
4

Questions tagged [on-policy-methods]