1

What strategies are there to reduce the variance of the policy gradient estimator of the REINFORCE algorithm?

I know one possibility is to subtract a baseline as a running average of rewards from past mini-batches. Another is to compute the mean and variance of each trajectory over one mini-batch and standardise the values. A third one is to use large batch sizes.

What is considered the most effective? What other methods are there?

nbro
  • 39,006
  • 12
  • 98
  • 176
  • Eligibility traces and actor-critic method would have much less variance compared to REINFORCE while eligibility traces often works best in practice. – mohottnad Dec 12 '22 at 05:41

0 Answers0