I'm working with a PPO agent with a small, discrete action space (3 possible actions, 1 of which is always masked depending on the state).
Premise 1: My understanding is that the "entropy" of output probabilities is calculated according to the formula given here.
Premise 2: My entropy during training goes down to <0.05 in less than 100k timesteps. Using the above formula, I roughly estimate that this means the agent is >98% confident in all its decisions at that point.
Question: When the agent is already so confident, is there any point in continuing training? Will it learn anything else, or just keep exploiting the policy it found?
I realize I can run experiments to try to answer this empirically, and I have. It appears that it just stops learning at that point, but I wanted to validate my observations and my premises.