Highest Voted 'vanishing-gradient-problem' Questions - Artificial Intelligence Stack Exchange

6

votes

1 answer

If vanishing gradients are NOT the problem that ResNets solve, then what is the explanation behind ResNet success?

I often see blog posts or questions on here starting with the premise that ResNets solve the vanishing gradient problem. The original 2015 paper contains the following passage in section 4.1: We argue that this optimization difficulty is unlikely…

asked Mar 18 '20 at 07:52

Alexander Soare

1,319
2
11
26

6

votes

1 answer

Why do ResNets avoid the vanishing gradient problem?

I read that, if we use the sigmoid or hyperbolic tangent activation functions in deep neural networks, we can have some problems with the vanishing of the gradient, and this is visible by the shapes of the derivative of these functions. ReLU solves…

deep-learning activation-functions exploding-gradient-problem residual-networks vanishing-gradient-problem

asked Jan 30 '20 at 15:09

FraMan

189
2
10

5

votes

2 answers

What are the common pitfalls that we could face when training neural networks?

Apart from the vanishing or exploding gradient problems, what are other problems or pitfalls that we could face when training neural networks?

neural-networks vanishing-gradient-problem exploding-gradient-problem

asked May 04 '20 at 14:29

pjoter

51
1

5

votes

1 answer

What effect does batch norm have on the gradient?

Batch norm is a technique where they essentially standardize the activations at each layer, before passing it on to the next layer. Naturally, this will affect the gradient through the network. I have seen the equations that derive the…

deep-learning optimization batch-normalization vanishing-gradient-problem exploding-gradient-problem

asked Mar 27 '20 at 18:28

information_interchange

319
1
9

5

votes

1 answer

How to detect vanishing gradients?

Can vanishing gradients be detected by the change in distribution (or lack thereof) of my convolution's kernel weights throughout the training epochs? And if so how? For example, if only 25% of my kernel's weights ever change throughout the epochs,…

deep-learning convolutional-neural-networks deep-neural-networks vanishing-gradient-problem

asked Feb 19 '20 at 19:12

Elegant Code

153
1
7

4

votes

0 answers

Why does sigmoid saturation prevent signal flow through the neuron?

As per these slides on page 35: Sigmoids saturate and kill gradients. when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. the gradient and almost no signal will flow through the neuron…

neural-networks backpropagation weights sigmoid vanishing-gradient-problem

asked Jan 31 '21 at 20:56

EEAH

193
1
5

3

votes

1 answer

Why aren't artificial derivatives used more often to solve the vanishing gradient problem?

While looking into the vanishing gradient problem, I came across a paper (https://ieeexplore.ieee.org/abstract/document/9336631) that used artificial derivatives in lieu of the real derivatives. For a visualization, see the attached image: As you…

activation-functions vanishing-gradient-problem

asked Oct 09 '22 at 12:09

postnubilaphoebus

345
1
11

3

votes

0 answers

Would a different learning rate for every neuron and layer mitigate or solve the vanishing gradient problem?

I'm interested in using the sigmoid (or tanh) activation function instead of RELU. I'm aware of RELU advantages on faster computation and no vanishing gradient problem. But about vanishing gradient, the main problem is about the backpropagation…

deep-learning backpropagation activation-functions learning-rate vanishing-gradient-problem

asked Aug 06 '20 at 08:30

Rogelio Triviño

141
3

2

votes

1 answer

How does vanish gradient restrict RNN to not work for long range dependencies?

I am really trying to understand deep learning models like RNN, LSTMs etc. I have gone through many tutorials of RNN and have learned that RNN cannot work for long Range dependencies, like: Consider trying to predict the last word in the text “I…

deep-learning recurrent-neural-networks gradient-descent vanishing-gradient-problem

asked Oct 21 '20 at 12:11

Nafees Ahmed

41
3

2

votes

0 answers

How to decide if gradients are vanishing?

I am trying to debug a convolutional neural network. I am seeing gradients close to zero. How can I decide whether these gradients are vanishing or not? Is there some threshold to decide on vanishing gradient by looking at the values? I am getting…

convolutional-neural-networks activation-functions relu vanishing-gradient-problem adam

asked Oct 20 '20 at 05:22

pramesh

121
4

2

votes

2 answers

How do LSTM and GRU avoid to overcome the vanishing gradient problem?

I'm watching the video Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorflow Tutorial | Edureka where the author says that the LSTM and GRU architecture help to reduce the vanishing gradient problem. How do LSTM and GRU…

natural-language-processing long-short-term-memory vanishing-gradient-problem

asked Apr 06 '20 at 19:42

DRV

1,573
2
11
18

1

vote

1 answer

Mathematically speaking, Is it only the product operation used in the chain rule causing the vanishing or exploding gradient?

I am asking this question from the mathematical perspective of the vanishing and exploding gradient problems that we face generally during training deep neural networks. The chain rule of differentiation for a composite function can be expressed…

math gradient vanishing-gradient-problem exploding-gradient-problem

asked Sep 24 '21 at 00:52

hanugm

3,571
3
18
50

1

vote

0 answers

How do I infer exploding or vanishing gradients in Keras?

It may already be obvious that I am just a practitioner and just a beginner to Deep Learning. I am still figuring out lots of "WHY"s and "HOW"s of DL. So, for example, if I train a feed-forward neural network, or an image classifier with CNNs, or…

neural-networks convolutional-neural-networks recurrent-neural-networks backpropagation vanishing-gradient-problem

asked Dec 13 '20 at 07:43

Naveen Reddy Marthala

205
2
10

1

vote

1 answer

Does the paper "On the difficulty of training Recurrent Neural Networks" (2013) assume, falsely, that spectral radii are $\ge$ square matrix norms?

(link to paper in arxiv) In section 2.1 the authors define $\gamma$ as the maximum possible value of the derivative of the activation function (e.g. 1 for tanh.) Then they have this to say: We first prove that it is sufficient for $\lambda_1 <…

papers vanishing-gradient-problem

asked Apr 18 '20 at 21:43

Jeremiah England

161
4

1

vote

0 answers

Which activation functions can lead to the vanishing gradient problem?

From this video tutorial Vanishing Gradient Tutorial, the sigmoid function and the hyperbolic tangent can produce the vanishing gradient problem. What other activation functions can lead to the vanishing gradient problem?

machine-learning backpropagation gradient-descent activation-functions vanishing-gradient-problem

asked Mar 16 '20 at 11:58

DRV

1,573
2
11
18

Questions tagged [vanishing-gradient-problem]