0

I'm trying to implement Knowledge Distillation, specifically to reproduce the MNIST example given in the paper. My (PyTorch) implementation can be found here.

I would expect it is very self-evident that using this method indeed improves results (e.g., like in other cases I tried: dropout for generalization loss, BatchNorm for speed of training, etc.). But I couldn't get the taught network to be better than the "vanilla" student network - playing with different architectures (with/without dropout, with/without BatchNorm, different number of hidden layers, hidden layer size, etc.), and with different hyperparameters (learning rate, dropout rate, etc.). I also tried using only the soft-targets, as well as combining it with the hard-targets as described in the paper.

My teacher network didn't reach the 99.3% reported in the paper (= 67 errors), but was close (a bit below 99%) so I'm not sure if it's just a matter of training more. But if this is the case - it makes it that the benefits of using this method are extremely small. I did manage once to get better results with the taught network - but it was only in the first epoch, and with much lower accuracies. When trained for longer epochs, the regular student outperformed.

That being said - maybe I'm doing something wrong?

Maverick Meerkat
  • 412
  • 3
  • 11

0 Answers0