I am currently experimenting with PPO in different environments. I am interested in learning policies that fulfill a certain goal while keeping a specific value low. Here's an example:
Using PPO on a cartpole environment to learn an upswing of the pole, but simultaneously keeping the angular velocity of the pole low. The standard approach is to include a penalty on the pole velocity in the reward function.
However, I observed that penalizing the velocity from the beginning reduces sample efficiency significantly and hinders learning good policies. For this reason, I tried using only a small penalty on the pole velocity until PPO converges to a decent policy and then apply a refinement step in which I penalize the torque much more to get good performance and low velocities. This seems to work better. I observed similar behavior on other environments in a similar setting.
I want to find a (formal) reason for this behavior (why does penalizing velocities from the beginning hinders learning). Does anybody have some literature tips on stochastic optimization/rl that could be useful? Or some resources on the topology of high-dimensional spaces? Or even an idea for an explanation for this behavior?
Thanks in advance for any tips!!