For questions related to the vanishing gradient problem, which is a numerical problem that occurs while training a (deep) neural network with a gradient-based optimization technique. There's also the related exploding gradient problem.
Questions tagged [vanishing-gradient-problem]
20 questions
6
votes
1 answer
If vanishing gradients are NOT the problem that ResNets solve, then what is the explanation behind ResNet success?
I often see blog posts or questions on here starting with the premise that ResNets solve the vanishing gradient problem.
The original 2015 paper contains the following passage in section 4.1:
We argue that this optimization difficulty is unlikely…

Alexander Soare
- 1,319
- 2
- 11
- 26
6
votes
1 answer
Why do ResNets avoid the vanishing gradient problem?
I read that, if we use the sigmoid or hyperbolic tangent activation functions in deep neural networks, we can have some problems with the vanishing of the gradient, and this is visible by the shapes of the derivative of these functions. ReLU solves…

FraMan
- 189
- 2
- 10
5
votes
2 answers
What are the common pitfalls that we could face when training neural networks?
Apart from the vanishing or exploding gradient problems, what are other problems or pitfalls that we could face when training neural networks?

pjoter
- 51
- 1
5
votes
1 answer
What effect does batch norm have on the gradient?
Batch norm is a technique where they essentially standardize the activations at each layer, before passing it on to the next layer. Naturally, this will affect the gradient through the network. I have seen the equations that derive the…

information_interchange
- 319
- 1
- 9
5
votes
1 answer
How to detect vanishing gradients?
Can vanishing gradients be detected by the change in distribution (or lack thereof) of my convolution's kernel weights throughout the training epochs? And if so how?
For example, if only 25% of my kernel's weights ever change throughout the epochs,…

Elegant Code
- 153
- 1
- 7
4
votes
0 answers
Why does sigmoid saturation prevent signal flow through the neuron?
As per these slides on page 35:
Sigmoids saturate and kill gradients.
when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero.
the gradient and almost no signal will flow through the neuron…

EEAH
- 193
- 1
- 5
3
votes
1 answer
Why aren't artificial derivatives used more often to solve the vanishing gradient problem?
While looking into the vanishing gradient problem, I came across a paper (https://ieeexplore.ieee.org/abstract/document/9336631) that used artificial derivatives in lieu of the real derivatives. For a visualization, see the attached image:
As you…

postnubilaphoebus
- 345
- 1
- 11
3
votes
0 answers
Would a different learning rate for every neuron and layer mitigate or solve the vanishing gradient problem?
I'm interested in using the sigmoid (or tanh) activation function instead of RELU. I'm aware of RELU advantages on faster computation and no vanishing gradient problem. But about vanishing gradient, the main problem is about the backpropagation…

Rogelio Triviño
- 141
- 3
2
votes
1 answer
How does vanish gradient restrict RNN to not work for long range dependencies?
I am really trying to understand deep learning models like RNN, LSTMs etc. I have gone through many tutorials of RNN and have learned that RNN cannot work for long Range dependencies, like:
Consider trying to predict the last word in the text “I…

Nafees Ahmed
- 41
- 3
2
votes
0 answers
How to decide if gradients are vanishing?
I am trying to debug a convolutional neural network. I am seeing gradients close to zero.
How can I decide whether these gradients are vanishing or not? Is there some threshold to decide on vanishing gradient by looking at the values?
I am getting…

pramesh
- 121
- 4
2
votes
2 answers
How do LSTM and GRU avoid to overcome the vanishing gradient problem?
I'm watching the video Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorflow Tutorial | Edureka where the author says that the LSTM and GRU architecture help to reduce the vanishing gradient problem. How do LSTM and GRU…

DRV
- 1,573
- 2
- 11
- 18
1
vote
1 answer
Mathematically speaking, Is it only the product operation used in the chain rule causing the vanishing or exploding gradient?
I am asking this question from the mathematical perspective of the vanishing and exploding gradient problems that we face generally during training deep neural networks.
The chain rule of differentiation for a composite function can be expressed…

hanugm
- 3,571
- 3
- 18
- 50
1
vote
0 answers
How do I infer exploding or vanishing gradients in Keras?
It may already be obvious that I am just a practitioner and just a beginner to Deep Learning. I am still figuring out lots of "WHY"s and "HOW"s of DL.
So, for example, if I train a feed-forward neural network, or an image classifier with CNNs, or…

Naveen Reddy Marthala
- 205
- 2
- 10
1
vote
1 answer
Does the paper "On the difficulty of training Recurrent Neural Networks" (2013) assume, falsely, that spectral radii are $\ge$ square matrix norms?
(link to paper in arxiv)
In section 2.1 the authors define $\gamma$ as the maximum possible value of the derivative of the activation function (e.g. 1 for tanh.) Then they have this to say:
We first prove that it is sufficient for $\lambda_1 <…

Jeremiah England
- 161
- 4
1
vote
0 answers
Which activation functions can lead to the vanishing gradient problem?
From this video tutorial Vanishing Gradient Tutorial, the sigmoid function and the hyperbolic tangent can produce the vanishing gradient problem.
What other activation functions can lead to the vanishing gradient problem?

DRV
- 1,573
- 2
- 11
- 18