Questions tagged [on-policy-methods]

For questions related to the "on-policy" reinforcement learning algorithms.

On-policy RL algorithms use their current approximation of the policy they attempt to estimate in order to interact with the environment (to gain experience and further update their approximation). An example of an on-policy algorithm is SARSA.

33 questions
14
votes
1 answer

What is the relation between online (or offline) learning and on-policy (or off-policy) algorithms?

In the context of RL, there is the notion of on-policy and off-policy algorithms. I understand the difference between on-policy and off-policy algorithms. Moreover, in RL, there's also the notion of online and offline learning. What is the relation…
6
votes
2 answers

What is the difference between on and off-policy deterministic actor-critic?

In the paper Deterministic Policy Gradient Algorithms, I am really confused about chapter 4.1 and 4.2 which is "On and off-policy Deterministic Actor-Critic". I don't know what's the difference between two algorithms. I only noticed that the…
6
votes
1 answer

If $\gamma \in (0,1)$, what is the on-policy state distribution for episodic tasks?

In Reinforcement Learning: An Introduction, section 9.2 (page 199), Sutton and Barto describe the on-policy distribution in episodic tasks, with $\gamma =1$, as being \begin{equation} \mu(s) = \frac{\eta(s)}{\sum_{k \in S}…
6
votes
1 answer

Is Expected SARSA an off-policy or on-policy algorithm?

I understand that SARSA is an On-policy algorithm, and Q-learning an off-policy one. Sutton and Barto's textbook describes Expected Sarsa thusly: In these cliff walking results Expected Sarsa was used on-policy, but in general it might use a…
6
votes
1 answer

Convergence of semi-gradient TD(0) with non-linear function approximation

I am looking for a result that shows the convergence of semi-gradient TD(0) algorithm with non-linear function approximation for on-policy prediction. Specifically, the update equation is given by (borrowing notation from Sutton and Barto…
5
votes
1 answer

Why does off-policy learning outperform on-policy learning?

I am self-studying about Reinforcement Learning using different online resources. I now have a basic understanding of how RL works. I saw this in a book: Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal…
4
votes
1 answer

Why is GLIE Monte-Carlo control an on-policy control?

In slide 16 of his lecture 5 of the course "Reinforcement Learning", David Silver introduced GLIE Monte-Carlo Control. But why is it an on-policy control? The sampling follows a policy $\pi$ while improvement follows an $\epsilon$-greedy policy, so…
4
votes
1 answer

How should I generate datasets for a SARSA agent when the environment is not simple?

I am currently working on my master's thesis and going to apply Deep-SARSA as my DRL algorithm. The problem is that there is no datasets available and I guess that I should generate them somehow. Datasets generation seems a common feature in this…
4
votes
1 answer

What does the term $|\mathcal{A}(s)|$ mean in the $\epsilon$-greedy policy?

I've been looking online for a while for a source that explains these computations but I can't find anywhere what does the $|A(s)|$ mean. I guess $A$ is the action set but I'm not sure about that notation: $$\frac{\varepsilon}{|\mathcal{A}(s)|}…
4
votes
1 answer

What is the difference between on-policy and off-policy for continuous environments?

I'm trying to understand RL applied to time series (so with infinite horizon) which have a continous state space and a discrete action space. First, some preliminary questions: in this case, what is the optimal policy? Given the infinite horizon…
3
votes
0 answers

How to fix high variance of the returns on a 2d env?

I'm trying to train an agent on a self-written 2d env, and it just doesn't converge to the solution. It is basically a 2d game where you have to move a small circle around the screen and try to avoid collisions with randomly moving "enemy" circles…
3
votes
1 answer

Is it possible to apply a particular exploration policy for the on-policy RL agents?

Is it possible to use any particular strategy to explore (e.g. metaheuristics) in on-policy algorithms (e.g. in PPO) or is it only possible to define particular policies to explore in off-policy algorithms (e.g. TD3)?
3
votes
1 answer

Why can we take the action $a$ from the next state $s'$ in the max part of the Q-learning update rule, if that action doesn't lead to any reward?

I'm using OpenAI's cartpole environment. First of all, is this environment not Markov? Knowing that, my main question concerns Q-learning and off-policy methods: For me, there is something weird in updating a Q value based on the max Q for a state…
2
votes
1 answer

Why is the actor-critic algorithm limited to using on-policy data?

Why is the actor-critic algorithm limited to using on-policy data? Or can we use the actor-critic algorithm with off-policy data?
2
votes
1 answer

Do we need the transition probability function when calculating the importance sampling ratio?

I am reading the book titled "Reinforcement Learning: An Introduction" (by Sutton and Barto). I am at chapter 5, which is about Monte Carlo methods, but now I am quite confused. There is one thing I don't particularly understand. Why do we need the…
1
2 3