Why do off-policy algorithms suffer from worse computational or time efficiency compared to on-policy algorithms?

Question

When I run Soft-Actor-Critic (off-policy) in my Environment, the calculation of gradient updates takes almost twice the time compared to using PPO (on-policy). I also saw that ACER has a higher time complexity compared to PPO in this paper PPO with Prioritized Trajectory Replay

In their comparison, ACER took almost 5 times longer (wall-clock time) compared to PPO for 1000 steps in Altantis-v0 benchmark.

That's why it came to my mind that this maybe has to do with their different algorithmic approaches (off- vs. on-policy).

This question is not about the difference between sample- and computational- or time efficiency, as this was quite well explained here.

The question is: Is my assumption correct that off-policy algorithms are more computationally or time complex than on-policy ones? If so, why? Is there a certain component like the replay buffer which increases complexity?

Or is my assumption incorrect and the computational or time complexity differences between the algorithms was only implementation related, meaning that off-policy algorithms doesn't necessarily have to be more computational complex or take more time than on-policy ones?

I think your question in the body of the post is partially a duplicate of [this](https://ai.stackexchange.com/q/35182/2444). The questions are not exactly the same, but my answer there should give you some insight on why more sample efficient algorithms might require more computational resources. By the way, you're asking here 2 distinct questions (the one in the title and the other one) - I recommend that you clarify what your main question is. — nbro, May 02 '22 at 14:44
@nbro Thanks for your reply. I indeed read that question and it addresses the topic sample vs computational efficiency in that it gives examples of algorithms that fall more or less in some category compared to others. However, my question is why is this the case: Like why is it that e.g. SAC has a higher computational complexity compared to PPO? Is it because the sampling of a replay buffer? Or was this just some coincidence and is not linked to algo categories off-/on-policy? I will adapt my question to make my question more clear in this regard. — kitaird, May 03 '22 at 18:12
1. I don't see where that paper you link talks about computational complexity. Wall time is not the same as computational complexity. — Taw, May 04 '22 at 04:20
@Taw True, wall time is not the same as computational complexity. I will change my question to ask about time complexity. But PPO is an on-policy algorithm. — kitaird, May 04 '22 at 09:00
@Taw PPO is on-policy, as far as I am aware. It uses trajectories generated by the current policy to make updates to the policy parameters. — David, May 04 '22 at 09:46
@kitaird off-policy algorithms will generally have higher computational complexity than on-policy algorithms simply because you are doing more gradient updates (in general) than on-policy algorithms. Consider the SAC algorithm: at each step of an episode we sample a batch from our replay buffer and make a gradient update. If the batch size is large, then this computational overhead will be large. Compare this to PPO where we only make updates every $T$, it makes sense that the overhead would be less. — David, May 04 '22 at 09:48
It is technically off-policy because the current policy may be slightly different than the policy used to generate the trajectories. But maybe it was too pedantic to bring that up. — Taw, May 04 '22 at 09:51
@kitaird Its difficult to compare wall times of algorithms because they may use different network architectures. It's not clear to me that the paper you linked (which does not appear to be published) accounts for that. — Taw, May 04 '22 at 09:52
@Taw I see what you're getting at; I think you could argue for it being a hybrid of both on and off policy, but I would say that because the trust region constraint the updated policy to not stray too far from the original policy that generated the rollouts then it is still considered on-policy. — David, May 04 '22 at 10:02

Why do off-policy algorithms suffer from worse computational or time efficiency compared to on-policy algorithms?

0 Answers0