11

Do off-policy policy gradient methods exist?

I know that policy gradient methods themselves using the policy function for sampling rollouts. But can't we easily have a model for sampling from the environment? If so, I've never seen this done before.

nbro
  • 39,006
  • 12
  • 98
  • 176
echo
  • 673
  • 1
  • 5
  • 12
  • An extension to this would be - Can we leverage experience replay concept in policy gradient algo? I understand the answer is No because we are not talking of an off policy update, but are there alternatives or workarounds – Allohvk Nov 23 '22 at 07:41

1 Answers1

6

Absolutely, it’s a really interesting problem. Here is a paper detailing off policy actor critic. This is important because this method can also support continuous actions.

The general idea of off-policy algorithms is to compare the actions performed by a behaviour policy (which is actually acting in the world) with the actions the target policy (the policy we want to learn) would have chosen. Using this comparison we can determine a ratio ($0 \leq \rho \leq 1$) which can scale the update to the target policy by the probability of the target policy taking that action. A higher $\rho$, the more alike the two policies are, and this increases the magnitude of the learning update for the target policy for that step. A $\rho$ of $0$, and the update is ignored.

nbro
  • 39,006
  • 12
  • 98
  • 176
Jaden Travnik
  • 3,767
  • 1
  • 16
  • 35