I'm trying to train some deep RL agents using policy gradient methods like AC and PPO. While training, I have a ton of different metrics being monitored.
I understand that the ultimate goal is to maximize the reward or return per episode. But there are a ton of other metrics that I don't understand what they are used for.
In particular, how should one interpret the mean and standard deviation curves of the policy loss, value, value loss, entropy, and reward/return over time while training?
What does it mean when these values increase or decrease over time? Given these curves, how would one decide how to tune hyperparameters, see where the training is succeeding and failing, and the like?