What strategies are there to reduce the variance of the policy gradient estimator of the REINFORCE algorithm?
I know one possibility is to subtract a baseline as a running average of rewards from past mini-batches. Another is to compute the mean and variance of each trajectory over one mini-batch and standardise the values. A third one is to use large batch sizes.
What is considered the most effective? What other methods are there?