When I run Soft-Actor-Critic (off-policy) in my Environment, the calculation of gradient updates takes almost twice the time compared to using PPO (on-policy). I also saw that ACER has a higher time complexity compared to PPO in this paper PPO with Prioritized Trajectory Replay
In their comparison, ACER took almost 5 times longer (wall-clock time) compared to PPO for 1000 steps in Altantis-v0 benchmark.
That's why it came to my mind that this maybe has to do with their different algorithmic approaches (off- vs. on-policy).
This question is not about the difference between sample- and computational- or time efficiency, as this was quite well explained here.
The question is: Is my assumption correct that off-policy algorithms are more computationally or time complex than on-policy ones? If so, why? Is there a certain component like the replay buffer which increases complexity?
Or is my assumption incorrect and the computational or time complexity differences between the algorithms was only implementation related, meaning that off-policy algorithms doesn't necessarily have to be more computational complex or take more time than on-policy ones?