Just made an interesting observation playing around with the stable-baseline's implementation of PPO and the BipedalWalker environment from OpenAI's Gym. But I believe this should be a general property of deep learning.
Using a small batch size of 512 samples the walker achieves a near-optimal behavior after just 0.5 Mio steps. The optimized hyperparameters in the RL Zoo suggest using a batch size of 32k steps. This definitely leads to better performance after 5 Mio steps but takes 2 Mio steps until it reaches a near-optimal behavior.
Therefore the question: Shouldn't we schedule the batch-size to improve sample efficiency?
I believe it makes sense because after initialization the policy is far away from the optimal one and therefore should update quickly to get better. Even when the gradient estimates using small batches are very noisy, they still seem to bring the policy quickly in a quite good state. Thereafter we can increase the batch-size and make less but more precise gradient steps. Or am I missing an important point here?