0

I've been trying to understand the Distilling the Knowledge in a Neural Network paper by Hinton et al. But I cannot fully understand this:

When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases [...]

The information part is very clear, but how does high entropy correlate to less variance between training cases?

nbro
  • 39,006
  • 12
  • 98
  • 176

1 Answers1

0

Since it is a trained network already, when you run an example through it, the gradient will not have a very high variance.

The gradient varies a lot when you are training a network from the scratch but then it stops varying much since it understands the pattern.

nbro
  • 39,006
  • 12
  • 98
  • 176
Abhishek Verma
  • 858
  • 3
  • 6