3

I'm trying to train some deep RL agents using policy gradient methods like AC and PPO. While training, I have a ton of different metrics being monitored.

I understand that the ultimate goal is to maximize the reward or return per episode. But there are a ton of other metrics that I don't understand what they are used for.

In particular, how should one interpret the mean and standard deviation curves of the policy loss, value, value loss, entropy, and reward/return over time while training?

What does it mean when these values increase or decrease over time? Given these curves, how would one decide how to tune hyperparameters, see where the training is succeeding and failing, and the like?

nbro
  • 39,006
  • 12
  • 98
  • 176
  • I think you should focus on one metric and one algorithm. In any case, I think you should provide a definition of "policy loss" and "value loss". And are you calculating the entropy of what exactly? – nbro Jul 07 '20 at 22:49

1 Answers1

1

As you said, generally the most important one is reward per episode. If this isn't increasing overall, there's a problem (of course this metric can fluctuate, I mean to say that macroscopically it should increase).

Policy loss (I assume you mean the "actor loss"?) is generally harder to interpret. You should think of this more as a source of gradients and not necessarily a good indicator of how well your agent is performing.

I'm not really sure why you'd be monitoring the value during training. Value loss, however, is basically equivalent to value loss in value based methods like Q-learning, for example. So this one should be decreasing overall. Otherwise the baselines you compute to reduce variance in the policy gradient will either be less effective, or even harmful.

Entropy is a nice quantity to measure, because it's a good indicator of how much your agent is exploring. If you see that your agent is not achieving high returns and the entropy is really low, this means that your policy has converged to a suboptimal one. If the entropy is really high, this means the agent is acting fairly randomly (so it's basically exploring a lot). Ideally the entropy should decrease over time, so your policy becomes more deterministic (less exploration) as it reaches an optimum.

harwiltz
  • 1,091
  • 1
  • 6
  • 6