In this lecture, the professor says that one problem with the sigmoid function is that its outputs aren't zero-centered. Are the explanation provided by the professor regarding why this is bad is that the gradient of our loss w.r.t. the weights $\frac{\partial L}{\partial w}$ which is equal to $\frac{\partial L}{\partial \sigma}\frac{\partial \sigma}{\partial w}$ will always be either negative or positive and we'll have a problem updating our weights as she shows in this slide, we won't be able to move in the direction of the vector $(1,-1)$. I don't understand why since she only talks about one component of our gradient and not the whole vector. if the components of the gradient of our loss will have different signs which will allow us to adjust to different directions I'm I wrong? But the thing that I don't understand is how this property generalizes to non zero-centered functions and non-zero centered data?
1 Answers
Yes, if the activation function of the network is not zero centered, $y = f(x^{T}w)$ is always positive or always negative. Thus, the output of a layer is always being moved to either the positive values or the negative values. As a result, the weight vector needs more updates to be trained properly, and the number of epochs needed for the network to get trained also increases. This is why the zero centered property is important, though it is NOT necessary.
Zero-centered activation functions ensure that the mean activation value is around zero. This property is important in deep learning because it has been empirically shown that models operating on normalized data––whether it be inputs or latent activations––enjoy faster convergence.
Unfortunately, zero-centered activation functions like tanh
saturate at their asymptotes –– the gradients within this region get vanishingly smaller over time, leading to a weak training signal.
ReLU
avoids this problem but it is not zero-centered. Therefore all-positive or all-negative activation functions whether sigmoid
or ReLU
can be difficult for gradient-based optimization. So, To solve this problem deep learning practitioners have invented a myriad of Normalization layers (batch norm, layer norm, weight norm, etc.). we can normalize the data in advance to be zero-centered as in batch/layer normalization.
Reference:
A Survey on Activation Functions and their relation with Xavier and He Normal Initialization
-
Thank you for your answer. Can you provide more details please ? I don't understand how if the $y=f(x^{T}w)$ function isn't *zero-centered* then the outputs will always be either positive or negative. If we can imagine that $f$ has a symetrical graph (like a gaussian function) shifted towards the positives for example, then it'll all depends on the "width" am I wrong ? A more concrete example: $y=tanh(z-2)$. Then if $x^{T}w=[-4,2,4]$ we'll have $tanh(x^{T}w)=[-1,0,1]$. Do we always expect that $x^{T}w$ will be very close to zero so that we aren't in the later case? – Daviiid Mar 28 '21 at 15:37
-
1like For the sigmoid function, the gradient saturates as $σ(x)→1$ or $σ(x)→0$ . We can simply tell this by the slope of the function as these points are approached, and also compute the derivative. We know that $σ′(x)=σ(x)(1−σ(x))$ , so $σ′(x)→0$ at these extreme values. – Faizy Mar 28 '21 at 16:50
-
1Therefore, if inputs are `not zero-centered` and with sigmoid , saturation will be more likely than if our inputs were zero-centered. but early on in learning, we generally want **larger gradients**, since our error is high. This can drastically slow down $\eta$. Since `zero-centering` the data will bring the inputs of the sigmoid away from where they can **saturate**, `zero-centering` could increase the gradient signal during the early stages of learning, providing for faster learning. – Faizy Mar 28 '21 at 16:58
-
you know that for sigmoid the bound in the range of (0,1). Hence it always produces a non-negative value as output. – Faizy Mar 28 '21 at 17:21
-
This maybe somewhat out of context but let me ask it, "For negative values, Relu activation gives zero as output. Is it be a problem be for negative values weights of that layer will not update just like vanishing gradient of Tanh activation? – Swakshar Deb Mar 28 '21 at 18:25
-
2@Faizy thank you for your answer. I think the term zero-centered may be a little confusing because we may have zero-centered data but that have big magnitude, imagine the set of points $\{ -10, -5, -3, 3, 5, 10\}$. Then we may say that this set is zero-centered, and it'll still give us saturation. But I understand what you mean. Thank you for all the explanation. – Daviiid Mar 28 '21 at 18:42
-
@SwaksharDeb I don't think the problem comes from the weights alone since you have $tanh(x^{T}w)$. As long as $x^{T}w$ is negative that neuron won't be updated since the gradient will be zero. I believe this happens for two reasons: *1)* When you initialize the weights, you're unfortunate to have $x^{T}w$ negative and you get no update. *2)* You have good initialization but you update the weights in such a way that $x^{T}w$ become negative and you're neuron just dies. But since we're only able to work on our weights I guess you're right. – Daviiid Mar 28 '21 at 18:47