Why does off-policy learning outperform on-policy learning?

Question

I am self-studying about Reinforcement Learning using different online resources. I now have a basic understanding of how RL works.

I saw this in a book:

Q-learning is an off-policy learner. An off-policy learner learns the value of an optimal policy independently of the agent’s actions, as long as it explores enough.

An on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps.

However, I am not quite understanding the difference. Secondly, I came across that off-policy learner works better than the on-policy agent. I don't understand why that would be i.e. why off-policy would be better than the on-policy.

Here's a very related question https://stats.stackexchange.com/q/184657 — mugoh, Nov 26 '20 at 04:42
You say "Secondly, I came across that off-policy learner works better than the on-policy agent. ". Where did you hear/read this? — nbro, Nov 26 '20 at 09:49

score 5 · Answer 1 · edited Nov 28 '20 at 02:01

5

This post contains many answers that describe the difference between on-policy vs. off-policy.

Your book may be referring to how the current (DQN-based) state-of-the-art (SOTA) algorithms, such as Ape-X, R2D2, Agent57 are technically "off-policy", since they use a (very large!) replay buffer, often filled in a distributed manner. This has a number of benefits, such as reusing experience and not forgetting important experiences.

Another benefit is that you can collect a lot of experience distributedly. Since RL is typically not bottlenecked by the computation for training but rather from collecting experiences, the distributed replay buffer in Ape-X can enable much faster training, in terms of seconds but not sample complexity.

However, it's important to emphasize that these replay-buffer approaches are almost on-policy, in the sense that the replay buffer is constantly updated with new experiences. So, the policy in the replay buffer is "not too different" from your current policy (just a few gradient steps away). Most importantly, this allows the policy to learn from its own mistakes if it makes any...

Off-policy learning, in general, can also refer to batch RL (a.k.a. offline RL), where you're provided a dataset of experiences from another behavior policy, and your goal is to improve over it. Notably, you don't get to rollout your current policy in any way! In this case, algorithms that worked well with a replay-buffer (like DQN, SAC) fail miserably, since they over-estimate the value of actions when they extrapolate outside the "support" of the dataset. See the BCQ paper which illustrates how a lot of "off-policy" algorithms like DQN fail when the "distance between the two policies is large". For this task, SOTA is a form of weighted behavioral cloning called Critic Regularized Regression (CRR).

It's also worth noting that importance sampling can correct off-policy gradients to be on-policy; but the farther away your target policy is, the larger the variance. This is especially deadly for long horizon tasks (often called curse of horizon).

To summarize, using replay-buffer (which makes the algorithm off-policy), especially a distributed one, can offer a lot of benefits over pure on-policy algorithms. However, this is a very special class of off-policy algorithms, where the behavioral policy is close to your policy.

But in general, off-policy is a lot harder than on-policy; you'll suffer from extrapolation bias if you use DQN-based approaches, and exponential variance blow-up if you use importance sampling to correct for it.

edited Nov 28 '20 at 02:01

nbro

39,006
12
98
176

answered Nov 26 '20 at 05:23

kaiwenw

151
3

1

thanks for the response. But your answer goes way above my head :-( – Exploring Nov 26 '20 at 06:48
which part do you want clarification on? do you understand on-policy vs. off-policy after reading @mugoh's post? – kaiwenw Nov 26 '20 at 06:55
Maybe you should cite a paper that supports this "in this case, a lot of the traditional "off-policy" algos like DQN, SAC fail miserably, and the SOTA is a form of weighted behavioral cloning called Critic Regularized Regression (CRR)", for those not familiar with the SOTA. It's also not clear what you mean by the last sentence and in particular "but in general on-policy is much easier than off-policy!" (easier in terms of what?). Also, not sure why you say that DQN-based algorithms are "almost on-policy". That seems like saying that there are almost no off-policy algorithms. – nbro Nov 26 '20 at 10:42
I would actually say that off-policy algorithms can become on-policy, i.e. they generalize on-policy ones, but they are not "almost on-policy", if you want to be precise. Well, at least, I don't fully get what you mean by that. – nbro Nov 26 '20 at 10:46
1

This answer contains useful information, but I think it is pitched incorrectly to level of knowledge displayed in the question. – Neil Slater Nov 26 '20 at 12:20
@nbro - being "almost on policy" means that the distribution of action choice between behaviour policy and target policy is close - e.g. a low KL divergence (although that is probably not the right measure for RL policy comparisons). If there was a high divergence in off-policy RL, then the learning rate per time step experienced can slow down dramatically, as the amount of data received that is relevant to the target policy will be low. – Neil Slater Nov 26 '20 at 12:24
@NeilSlater Yes, sure you can measure the distance between two probability distributions, so that makes sense to say "almost on-policy" provided that is really true in the cases the OP is mentioning. When you say "high-variance", do you mean variance in the type of experience received? If the behaviour policy is very different from the target policy, I guess that having low variance of the experience received with the behaviour policy may not necessarily be a good thing. In any case, that statement is very compact and it's not fully/precisely clear what you mean by that. – nbro Nov 26 '20 at 12:54
@nbro: The variance in estimated returns increases, the more you have to correct for difference between on-policy and off-policy distributions. – Neil Slater Nov 26 '20 at 13:20
@nbro you're right that off-policy generalizes on-policy, but as the two policies become more and more different, the MSE in the estimate will become very large – kaiwenw Nov 26 '20 at 19:08
@kaiwenw yes, that's what Neil was saying above, and I see your point now. Can you please edit your answer to include these details? I think they are useful, then I could clean up all these comments (that may just be noisy). – nbro Nov 26 '20 at 19:11

Why does off-policy learning outperform on-policy learning?

1 Answers1