What strategies are there to reduce the variance of the policy gradient estimator of the REINFORCE algorithm?

Asked Nov 24 '22 at 00:57

Active Dec 11 '22 at 12:49

Viewed 65 times

I know one possibility is to subtract a baseline as a running average of rewards from past mini-batches. Another is to compute the mean and variance of each trajectory over one mini-batch and standardise the values. A third one is to use large batch sizes.

What is considered the most effective? What other methods are there?

edited Dec 11 '22 at 12:49

nbro

39,006
12
98
176

asked Nov 24 '22 at 00:57

postnubilaphoebus

Eligibility traces and actor-critic method would have much less variance compared to REINFORCE while eligibility traces often works best in practice. – mohottnad Dec 12 '22 at 05:41

What strategies are there to reduce the variance of the policy gradient estimator of the REINFORCE algorithm?

0 Answers0