There are two sources that I'm using to to try and understand why LSTMs reduce the likelihood of the vanishing gradient problem associated with RNNs.
Both of these sources mention the reason LSTMs are able to reduce the likelihood of the vanishing gradient problem is because
- The gradient contains the forget gate's vector of activions
- The addition of four gradient values help balance gradient values
I understand (1), but I don't understand what (2) means.
Sources are SLIDE 119 on http://cs231n.stanford.edu/slides/2020/lecture_10.pdf
Sentence beginning "Another important property to notice is that the cell state" on https://medium.com/datadriveninvestor/how-do-lstm-networks-solve-the-problem-of-vanishing-gradients-a6784971a577
Any insight would greatly be appreciated!