2

Why is the actor-critic algorithm limited to using on-policy data? Or can we use the actor-critic algorithm with off-policy data?

nbro
  • 39,006
  • 12
  • 98
  • 176
apuffin
  • 31
  • 2

1 Answers1

1

It's because, in the actor-critic algorithm, the objective function is an expectation under the $\tau$ of the policy. If we want to use off-policy data, we have to resort to importance sampling relative to the other policy.

nbro
  • 39,006
  • 12
  • 98
  • 176
apuffin
  • 31
  • 2