How does high entropy targets relate to less variance of the gradient between training cases?

Question

I've been trying to understand the Distilling the Knowledge in a Neural Network paper by Hinton et al. But I cannot fully understand this:

When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases [...]

The information part is very clear, but how does high entropy correlate to less variance between training cases?

score 0 · Answer 1 · edited Apr 09 '21 at 03:23

0

Since it is a trained network already, when you run an example through it, the gradient will not have a very high variance.

The gradient varies a lot when you are training a network from the scratch but then it stops varying much since it understands the pattern.

edited Apr 09 '21 at 03:23

nbro

39,006
12
98
176

answered Apr 06 '21 at 02:18

Abhishek Verma

858
3
6

How does high entropy targets relate to less variance of the gradient between training cases?

1 Answers1