Behaviour of PPO/similar Algos under action penalties

Question

I am currently experimenting with PPO in different environments. I am interested in learning policies that fulfill a certain goal while keeping a specific value low. Here's an example:

Using PPO on a cartpole environment to learn an upswing of the pole, but simultaneously keeping the angular velocity of the pole low. The standard approach is to include a penalty on the pole velocity in the reward function.

However, I observed that penalizing the velocity from the beginning reduces sample efficiency significantly and hinders learning good policies. For this reason, I tried using only a small penalty on the pole velocity until PPO converges to a decent policy and then apply a refinement step in which I penalize the torque much more to get good performance and low velocities. This seems to work better. I observed similar behavior on other environments in a similar setting.

I want to find a (formal) reason for this behavior (why does penalizing velocities from the beginning hinders learning). Does anybody have some literature tips on stochastic optimization/rl that could be useful? Or some resources on the topology of high-dimensional spaces? Or even an idea for an explanation for this behavior?

Thanks in advance for any tips!!

Welcome to AI stack exchange! Please put your question in the title of your post. Try to make it as clear as possible from the title what kind of answer you would like to receive. — Robin van Hoorn, May 02 '23 at 13:05

Behaviour of PPO/similar Algos under action penalties

0 Answers0