1

When using an on-policy method in reinforcement learning, like advantage actor-critic, you shouldn't use old data from an experience buffer, since a new policy requires new data. Does this mean that to apply batching to an on-policy method you have to have multiple parallel environments?

As an extension of this, if only one environment is available when using on-policy methods, does that mean batching isn't possible? Doesn't that limit the power of such algorithms in certain cases?

nbro
  • 39,006
  • 12
  • 98
  • 176
Daniel
  • 111
  • 2
  • [Here](https://ai.stackexchange.com/q/21109/2444), [here](https://ai.stackexchange.com/q/20189/2444) and [here](https://ai.stackexchange.com/q/20871/2444) are 3 related questions. – nbro Oct 12 '20 at 12:56

1 Answers1

1

We don't need multiple environments. On-policy algorithms require that new training samples are collected with the newest policy, so we can't use an experience buffer. However we can use the newest policy to collect multiple samples, even over multiple epochs, before updating the weights. This update can be a batch update.

Tom Dörr
  • 393
  • 1
  • 3
  • 7