How do LSTM and GRU avoid to overcome the vanishing gradient problem?

Question

I'm watching the video Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorflow Tutorial | Edureka where the author says that the LSTM and GRU architecture help to reduce the vanishing gradient problem. How do LSTM and GRU prevent the vanishing gradient problem?

score 0 · Answer 1 · edited May 25 '21 at 10:00

0

LSTMs solve the problem using a unique additive gradient structure that includes direct access to the forget gate's activations, enabling the network to encourage desired behaviour from the error gradient using frequent gates update on every time step of the learning process.

edited May 25 '21 at 10:00

Saurav Maheshkar

756
1
7
20

answered May 21 '21 at 12:59

Murtaza chohan

1

score 0 · Answer 2 · answered Feb 16 '23 at 03:55

0

LSTM passes the previous state's hidden weights to the current state. This simple yet effective solution helps them in minimizing the Vanishing gradient, because all states now have some information about all of the previous states. Consider like you are trading and you have all the numbers from a year ago, which surely helps in making better decisions!

I highly recommend this article, which explains the concept very well.

answered Feb 16 '23 at 03:55

Minh-Long Luu

1,120
2
20

Can you clarify which parts of that article are useful to understanding how LSTMs avoid the vanishing gradient problem? I've searched it for "vanish", "gradient" and "derivative" and found nothing. – Sycorax Mar 18 '23 at 18:33
"long-term dependencies" is the keyword. It is the same as "vanishing gradient". The dependencies require the network to backprop multiple times, thus "vanishing" appears. – Minh-Long Luu Mar 19 '23 at 01:27
Perhaps you could [edit] your answer to explain in detail how the LSTM mechanism overcomes the vanishing gradient problem. As it stands, the connection between the two is not entirely clear. – Sycorax Mar 30 '23 at 14:58

How do LSTM and GRU avoid to overcome the vanishing gradient problem?

2 Answers2