0

I'm trying to solve a reinforcement learning problem using a Monte Carlo policy gradient algorithm and, more specifically, REINFORCE, with rewards attributed to individual moves instead of applied to all steps in a rollout.

For this, I do $M$ rollouts, each with $N$ steps, and record the rewards. Let's say I fill an $M \times N$ matrix with the rewards. Sometimes just using these rewards as-is will work, but sometimes the rewards are always positive (or always negative), or the magnitudes cover a large range.

A simple thing is to just subtract the overall mean and divide it by the overall standard deviation.

In my particular case, though, the beginning is easier, and during bootstrapping the rewards will be higher. A typical case would have high rewards at the beginning with a taper to zero before the end of the rollout. So, it seems to make sense to subtract the mean along with the trial ($M$) dimension. Likewise, it might make sense to normalize the standard deviation along that dimension as well.

My question: Have others already figured this out and developed best practices for normalizing rewards?

I may have added my own twist, but I'm training in batches and using the stats of the batch (multiple rollouts) to do normalization. The subtracting the mean part is called a baseline in some papers I think. This article discusses it a bit.

hanugm
  • 3,571
  • 3
  • 18
  • 50
Mastiff
  • 121
  • 3
  • Could you please provide the name of or link to the reference that describes the specific "MC PG" algorithm you're using? – nbro Jul 16 '21 at 12:30
  • It's REINFORCE, with rewards attributed to individual moves instead of applied to all steps in a rollout. I may have added my own twist, but I'm training in batches and using the stats of the batch (multiple rollouts) to do normalization. The subtracting the mean part is called a baseline in some papers I think. This article discusses it a bit: https://medium.com/@fork.tree.ai/understanding-baseline-techniques-for-reinforce-53a1e2279b57 – Mastiff Jul 16 '21 at 15:27
  • 1
    Thanks for clarifying this. I've added the info in your comment to your own post. Feel free to edit your post again if you think it can still be improved. Later, I will delete these comments. – nbro Jul 16 '21 at 17:57

0 Answers0