Do off-policy policy gradient methods exist?

Question

I know that policy gradient methods themselves using the policy function for sampling rollouts. But can't we easily have a model for sampling from the environment? If so, I've never seen this done before.

An extension to this would be - Can we leverage experience replay concept in policy gradient algo? I understand the answer is No because we are not talking of an off policy update, but are there alternatives or workarounds — Allohvk, Nov 23 '22 at 07:41

score 6 · Answer 1 · edited Nov 19 '19 at 01:54

Absolutely, it’s a really interesting problem. Here is a paper detailing off policy actor critic. This is important because this method can also support continuous actions.

The general idea of off-policy algorithms is to compare the actions performed by a behaviour policy (which is actually acting in the world) with the actions the target policy (the policy we want to learn) would have chosen. Using this comparison we can determine a ratio ($0 \leq \rho \leq 1$) which can scale the update to the target policy by the probability of the target policy taking that action. A higher $\rho$, the more alike the two policies are, and this increases the magnitude of the learning update for the target policy for that step. A $\rho$ of $0$, and the update is ignored.

Do off-policy policy gradient methods exist?

1 Answers1