I've been trying to understand where the formulas for Xavier and Kaiming He initialization come from. My understanding is that these initialization schemes come from a desire to keep the gradients stable during back-propagation (avoiding vanishing/exploding gradients).
I think I can understand the justification for Xavier initialization, and I'll sketch it below. For He initialization, what the original paper actually shows is that that initialization scheme keeps the pre-activation values (the weighted sums) stable throughout the network. Most sources I've found explaining Kaiming He initialization seem to just take it as "obvious" that stable pre-activation values will somehow lead to stable gradients, and don't even mention the apparent mismatch between what the math shows and what we're actually trying to accomplish.
The justification for Xavier initialization (introduced here) is as follows, as I understand it:
As an approximation, pretend the activation functions don't exist and we have a linear network. The actual paper says we're assuming the network starts out in the "linear regime", which for the sigmoid activations they're interested in would mean we're assuming the pre-activations at every layer will be close to zero. I don't see how this could be justified, so I prefer to just say we're disregarding the activation functions entirely, but in any case that's not what I'm confused about here.
Zoom in on one edge in the network. It looks like $x\to_{w} y$, connecting the input or activation value $x$ to the activation value $y$, with the weight $w$. When we do gradient descent we consider $\frac{\partial C}{\partial w}$, and we have: $$\frac{\partial C}{\partial w}=x\frac{\partial C}{\partial y}$$ So if we want to avoid unstable $\frac{\partial C}{\partial w}$-s, a sufficient (not necessary, but that's fine) condition is to keep both those factors stable - the activations and the gradients with respect to activations. So we try to do that.
To measure the "size" of an activation, let's look at its mean and variance (where the randomness comes from the random weights). If we use zero-mean random weights all i.i.d. on each layer, then we can show that all of the activation values in our network are zero-mean, too. So controlling the size comes down to controlling the variance (big variance means it tends to have large absolute value and vice versa). Since the gradients with respect to activations are calculated by basically running the neural network backwards, we can show that they're all zero-mean too, so controlling their size comes down to controlling their variance as well.
We can show that all the activations on a given layer are identically distributed, and ditto for the gradients with respect to activations on a given layer. If $v_n$ is the variance of the activations on layer $n$, and if $v'_n$ is the variance of the gradients, we have $$v_{n+1}=v_n k_n \sigma^2$$ $$v'_n=v_{n+1} k_{n+1} \sigma^2$$
where $k_i$ is the number of neurons on the $i$-th layer, and $\sigma^2$ is the variance of the weights between the $n$-th and $n+1$-th layers. So to keep either of the growth factors from being too crazy, we would want $\sigma^2$ to be equal to both $1/k_n$ and $1/k_{n+1}$. We can compromise by setting it equal to the harmonic mean or the geometric mean or something like that.
- This stops the activations from exploding out of control, and stops the gradients with respect to activations from exploding out of control, which by step (2) stops the gradients with respect to the weights (which at the end of the day are the only things we really care about) from growing out of control.
However, when I look at the paper on He initialiation, it seems like almost every step in this logic breaks down. First of all, the math, if I understand correctly, shows that He initialization can control the pre-activations, not the activations. Therefore, the logic from step (2) above that this tells us something about the gradients with respect to the weights fails. Second of all, the activation values in a ReLU network like the authors are considering are not zero-mean, as they point out themselves, but this means that even the reasoning as to why we should care about the variances, from step (3), fails. The variance is only relevant for Xavier initialization because in that setting the mean is always zero, so the variance is a reasonable proxy for "bigness".
So while I can see how the authors show that He initialization controls the variances of the pre-activations in a ReLU network, for me the entire reason as to why we should care about doing this has fallen apart.