0

Hi I basically have PPO as implemented in stable-baselines3 and so far it is not scaling well at all which is very concerning. The basic parallelism strategy is very similar to OpenAI five's training strategy: each worker has PPO with batch size 4096, parallel environment execution & 1 training step (to avoid stale gradients) & 4 steps per environment. Then each worker averages its weight updates then the loop repeats.

Why would this setup fail to scale across 1000's of cores? I've been for example training this on BiPedalWalker-Hardcore-v3, and it asymptotes at about 0 reward (up from -90).

P.S. if this is not enough information, then I'll add more.

profPlum
  • 360
  • 1
  • 9

0 Answers0