6

My knowledge

Suppose you have a layer that is fully connected, and that each neuron performs an operation like

a = g(w^T * x + b)

were a is the output of the neuron, x the input, g our generic activation function, and finally w and b our parameters.

If both w and b are initialized with all elements equal to each other, then a is equal for each unit of that layer.

This means that we have symmetry, thus at each iteration of whichever algorithm we choose to update our parameters, they will update in the same way, thus there is no need for multiple units since they all behave as a single one.

In order to break the symmetry, we could randomly initialize the matrix w and initialize b to zero (this is the setup that I've seen more often). This way a is different for each unit so that all neurons behave differently.

Of course, randomly initializing both w and b would be also okay even if not necessary.

Question

Is randomly initializing w the only choice? Could we randomly initialize b instead of w in order to break the symmetry? Is the answer dependent on the choice of the activation function and/or the cost function?

My thinking is that we could break the symmetry by randomly initializing b, since in this way a would be different for each unit and, since in the backward propagation the derivatives of both w and b depend on a(at least this should be true for all the activation functions that I have seen so far), each unit would behave differently. Obviously, this is only a thought, and I'm not sure that is absolutely true.

nbro
  • 39,006
  • 12
  • 98
  • 176
gvgramazio
  • 696
  • 2
  • 7
  • 19

3 Answers3

5

Randomising just b sort of works, but setting w to all zero causes severe problems with vanishing gradients, especially at the start of learning.

Using backpropagation, the gradient at the outputs of a layer L involves a sum multiplying the gradient of the inputs to layer L+1 by the weights (and not the biases) between the layers. This will be zero if the weights are all zero.

A gradient of zero at L's output will further cause all earlier layers(L-1, L-2 etc all the way back to layer 1) to receive zero gradients, and thus not update either weights or bias at the update step. So the first time you run an update, it will only affect the last layer. Then the next time, it will affect the two layers closest to the output (but only marginally at the penultimate layer) and so on.

A related issue is that with weights all zero, or all the same, maps all inputs, no matter how they vary, onto the same output. This also can adversely affect the gradient signal that you are using to drive learning - for a balanced data set you have a good chance of starting learning close to a local minimum in the cost function.

For deep networks especially, to fight vanishing (or exploding) gradients, you should initialise weights from a distribution that has an expected magnitude (after multiplying the inputs) and gradient magnitude that neither vanishes nor explodes. Analysis of values that work best in deep networks is how Xavier/Glorot initialisation were discovered. Without careful initialisation along these lines, deep networks take much longer to learn, or in worst cases never recover from a poor start and fail to learn effectively.

Potentially to avoid these problems you could try to find a good non-zero fixed value for weights, as an alternative to Xavier initialisation, along with a good magnitude/distribution for bias initialisation. These would both vary according to size of the layer and possibly by the activation function. However, I would suspect this could suffer from other issues such sampling bias issues - there are more weights, therefore you get a better fit to desired aggregate behaviour when setting all the weight values randomly than you would for setting biases randomly.

Neil Slater
  • 28,678
  • 3
  • 38
  • 60
  • What has setting w to zero do with vanishing gradients? It is a problem involved with saturation of the activation function not with the initialization of the parameters (unless you initialize them to a very high or low value). – gvgramazio Jun 19 '18 at 08:10
  • The gradient descend of a cost function is computed with respect to both the bias an the weight. Thus the actual formulation of both derivatives will depend on which cost function and activation function did we chose. In the cases that I have seen, both the derivative of b and w will be updated depending on the value of a. – gvgramazio Jun 19 '18 at 08:17
  • 1
    @gvgramazio: take a look at the backprop formula again. To get `dJ/da` for layer `L` you sum over `dJ/dz * W` from layer `L+1` (bias, the cost function, and activations elsewhere are *not* involved in that step). Intuitively this makes sense - if all weights are zero, then the influence of any activation to the next layer is also zero, so there is no gradient. So the gradients are *blocked* after the first hidden layer with zero weights. There still are non-zero gradients to the weights themselves in that layer . . . so eventually (and slowly) this will fix itself. – Neil Slater Jun 19 '18 at 08:20
  • Uh...sorry about my last comment. I checked right now my notes and did some calculus. Yeah, if the weights of all layers are all zero and the bias is not, then at the first update only the last layer will be affected, at the second iteration also the second one, at the third iteration also the third one and so on. You were right about this. – gvgramazio Jun 19 '18 at 08:22
  • @gvgrmazio you won't see it in the last layer but in the subsequent layers the gradient becomes dependent on the weights of the previous layer (in backprop it'll be the last layer) ..And poof due to 0 gradient vanishes..Only after the last layer becomes non zero the subsequent second last layer becomes non zero..So on –  Jun 19 '18 at 08:22
  • Yeah, I saw this now. – gvgramazio Jun 19 '18 at 08:24
  • But this fact is relevant to the fact that all wights are zero. I mean that b is enough to break the symmetry but not enough to solve the vanishing gradient problem. If I chose b randomly and weights all equal to some value (different from zero) I will get rid of both the problems, right? – gvgramazio Jun 19 '18 at 08:27
  • @gvgramazio Yes, possibly. I have modified my answer to address that. I think there may be other more subtle problems with that fix, although I am not sure. It may just be that Xavier initialisation works well, and there is no theoretical reason to explore an alternative and slightly more complex version of it that adjusts both weights to a fixed value and biases to a distribution. – Neil Slater Jun 19 '18 at 08:30
  • Oh, I know that. Simply the question comes to my mind since in the sources I'm following the topic was dismissed by simply saying that is necessary to randomly initialize w and not b. – gvgramazio Jun 19 '18 at 08:40
2

Most of the explanations given for choosing something or not choosing something (like hyperparameter tuning) in deep learning are based on empirical studies, like analysing the error over a number of iterations. So, this answer is what people in deep learning side give.

Since you have asked for a mathematical explanation, I suggest you read the paper Convergence Analysis of Two-layer Neural Networks with ReLU Activation (2017, NIPS). It talks about the convergence of SGD to global minima subject to weight initialisation being Gaussian using ReLU as an activation function. The paper considers a neural net with no hidden layer, just input, and output layers.

The very fact that analysis on such 'simple' network gets published in a very reputed and top conference itself suggests that the explanation you are seeking is not very easy and very few people work on the theoretical aspects of neural nets. IMHO, after some years as the research progresses, I might be able to edit this answer and give the necessary explanation that you sought. Till then this is the best I could do.

nbro
  • 39,006
  • 12
  • 98
  • 176
varsh
  • 562
  • 7
  • 19
  • You surely understood what I'm looking for. The [answer by Joe S](https://ai.stackexchange.com/a/6792/16199) is similar to the ones that were already given in previous similar posts that I linked at the beginning of my post. Look at the [answer by Neil Slater](https://ai.stackexchange.com/a/6794/16199), I think that he completely clear all my doubts. – gvgramazio Jun 19 '18 at 08:46
  • @gvgramazio . Yes, but what about the theoretical explanation that you are looking for? He (Neil Slater) has provided good intuitive and qualitative explanation. – varsh Jun 19 '18 at 10:23
  • He gave my a hint in the comments below his answer and then I found it by myself. In the back-propagation, compute the formulas for the gradient with respect to w and b for the layer i. If you do all the substitutions you'll find that the gradient of the layer i is directly proportional to the weight of the following layer while this is not true for b. This mean that w=0 means that the gradient will be stuck to 0. Note that this has nothing to do with symmetry but is due to the fact that w=0. If we choose w=1, we don't have this problem (but we could have symmetry problem if b=const). – gvgramazio Jun 19 '18 at 12:02
  • Suppose that for each neuron we have `z = w^T x + b` and `a = g(z)`. I prepend `d` to a variable to indicate that it is the derivative of the cost function with repect to that variable. For a generic layer i, we have `dw_i=dz_i x^T` and `db_i=dz_i` but `dz_i=w_{i+1}^T dz_{i+1} ∘ g'_1(z_1)` where `∘` is the Hadamard product, `g'` is the derivative of your activation function and maybe the transpositions could be different on the base of your implementation. The important thing is that if `w_{i+1} = 0` then `dw_i, db_i = 0`. – gvgramazio Jun 19 '18 at 12:13
  • This mean that, with all weights equal to zero. Only the last layer will have gradient different from zero, at the second iteration, only the last two, and so on... – gvgramazio Jun 19 '18 at 12:15
0

w should be randomized to small (nonzero) numbers so that the adjustments made by the backpropagation are more meaningful and each value in the matrix is updated a different amount. If you start with all zeros, it will still work, but take longer to get to a meaningful result. AFAIK, this was found empirically by various researchers and became common practice.

Randomizing b does not have the same effect of helping, therefore most people do not bother.

This choice is one of many that is made by the architect of the network and theoretically you could use an infinite number of w matrix initializations. The one commonly used just happens to be tested and generally works.

This video is better at explaining than I am: Lecture 8.4 — Neural Networks Representation | Model Representation-II — [Andrew Ng].

nbro
  • 39,006
  • 12
  • 98
  • 176
Joe S
  • 138
  • 6
  • You're answer isn't a real answer because you simply stated that randomizing b doesn't have the same effect without explaining the reason. I know that somewhat isn't helpful to initialize b instead of w since nobody do that, but I wished to know __why__. – gvgramazio Jun 19 '18 at 07:53
  • Fun fact: I was reading something from the professor in the link you provided when this post comes to my mind. :) – gvgramazio Jun 19 '18 at 08:47
  • If you start with all zeros it actually won't work: you need at least some non-zero weights in every layer or else nothing will happen during training. – Jeremy List Sep 19 '19 at 22:01