I've been trying to understand the Distilling the Knowledge in a Neural Network paper by Hinton et al. But I cannot fully understand this:
When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases [...]
The information part is very clear, but how does high entropy correlate to less variance between training cases?