Progress about how to best initialize the weights, is what has made neural networks to be popular again.
Initially (around the 80s I think), NNs were initialized from Normal distributions like $\mathcal{N}(0, I)$, but that caused unstable training prone to divergence. Also, initializing the weights (excluding the biases) to be constant is something you would NEVER do because the weights with same values are updated in the exactly the same way: so these will be just redundant, not helping the network to learn. Therefore, you want to pick random weights, but how?
Are there weight initialization methods theoretically proven to be worse than others?
Is there research into what weight initialization methods work best across the board for different kinds of architectures?
Consider that the choice of which weight initialization scheme to utilize should be made according to what activation function to use in your model, and in some cases is also related to the network architecture. So, weight init and activation are strictly related.
- This paper introduces what is now called the Glorot (or Xavier) initialization, also motivating why NNs initialized in the classical (thus wrong) way fail. In few words, the Glorot initialization is designed to work well with saturating activations like the
sigmoid
and tanh
, thus preventing them to saturate (which leads to vanishing gradients) early in training.
- This more recent paper, instead, introduces the He initialization strategy that is designed to work with the rectified activations (ReLU, leaky ReLU, etc), thus reducing the chances of dead units. Indeed, this weight init is also demonstrated along the ResNet architectures.
Indeed, this does not necessarily mean that e.g. using Glorot with ReLU is always bad. But these two papers provide an analysis of the variance of the weights with different initialization strategies, being designed to keep that in control.
A note about pre-training (also related to transfer learning): strictly speaking this is not a weight initialization strategy. Indeed, you can pick a popular architecture and download the weights pre-trained on ImageNet (for example); that would give a nice initial point, useful to speed-up convergence, down-steams tasks, and even when having few training data. I consider pre-training to be a second step, because the first time you have to pre-train the model yourself and so starting from a proper weight initialization is beneficial also for pre-training alone.