What is the impact of the initialization of weights in the performance of a neural network in machine learning?

Question

In my own experience, weight initialization matters for model convergence.

Theoretically, can different weight initialization methods eventually converge to the same optimal solution? Are their weight initialization methods theoretically proven to be worse than others?
Is there research into what weight initialization methods work best across the board for different kinds of architectures?

Luca Anzalone · Accepted Answer · 2023-05-23T10:43:28.007

Progress about how to best initialize the weights, is what has made neural networks to be popular again.

Initially (around the 80s I think), NNs were initialized from Normal distributions like $\mathcal{N}(0, I)$, but that caused unstable training prone to divergence. Also, initializing the weights (excluding the biases) to be constant is something you would NEVER do because the weights with same values are updated in the exactly the same way: so these will be just redundant, not helping the network to learn. Therefore, you want to pick random weights, but how?

Are there weight initialization methods theoretically proven to be worse than others?

Is there research into what weight initialization methods work best across the board for different kinds of architectures?

Consider that the choice of which weight initialization scheme to utilize should be made according to what activation function to use in your model, and in some cases is also related to the network architecture. So, weight init and activation are strictly related.

This paper introduces what is now called the Glorot (or Xavier) initialization, also motivating why NNs initialized in the classical (thus wrong) way fail. In few words, the Glorot initialization is designed to work well with saturating activations like the sigmoid and tanh, thus preventing them to saturate (which leads to vanishing gradients) early in training.
This more recent paper, instead, introduces the He initialization strategy that is designed to work with the rectified activations (ReLU, leaky ReLU, etc), thus reducing the chances of dead units. Indeed, this weight init is also demonstrated along the ResNet architectures.

Indeed, this does not necessarily mean that e.g. using Glorot with ReLU is always bad. But these two papers provide an analysis of the variance of the weights with different initialization strategies, being designed to keep that in control.

A note about pre-training (also related to transfer learning): strictly speaking this is not a weight initialization strategy. Indeed, you can pick a popular architecture and download the weights pre-trained on ImageNet (for example); that would give a nice initial point, useful to speed-up convergence, down-steams tasks, and even when having few training data. I consider pre-training to be a second step, because the first time you have to pre-train the model yourself and so starting from a proper weight initialization is beneficial also for pre-training alone.

For the constant init, see also https://ai.stackexchange.com/q/6789/2444. Btw, I think it's "Xavier" (I suspect this is just a typo). — nbro, May 22 '23 at 23:49

score 1 · Answer 2 · answered May 18 '23 at 21:09

Weight initialization can and often does matter, hence why pre-training language models are useful for downstream tasks.

A randomly initialized model is not guaranteed to converge, especially if the model is training on some downstream task with limited data. Furthermore, there is no best practice initialization for weights, but people tend to fall into two camps. Initialize from zero or from some random distribution. However, de-noising objectives can help improve initial weights for downstream tasks.

Example of benefits of pre-training: I pre-trained a model on 97M examples of molecules formatted in a particular molecular language called SMILES, and then trained the model on reaction prediction. Without pre-training, accuracy after just one epoch was .284%; with pre-training, it was 51%. This demonstrates that non-random initialized weights are preferable (in this case), to random ones.

What is the impact of the initialization of weights in the performance of a neural network in machine learning?

2 Answers2