Too small gradient on large neural network

Question

When training on large neural network, how to deal with the case that the gradients are too small to have any impact?

FYI, I have an RNN, which has multiple LSTM cells and each cell has hundreds of neurons. Each training data has thousands of steps, so the RNN would unroll thousands of times. When I print out all gradients, they are very small, like e-20 of the variable values. Therefore the training does not change the variable values at all.

BTW, I think this is not an issue of vanishing gradients. Note that the gradients are uniformly small from the beginning to the end.

Any suggestion to overcome this issue?

Thank!

Welcome to ai.se...great question, low gradient means your error is also low...why do u need a high accuracy? Also you can increase the floating point representation if you want high accuracy, or maybe increase the learning rate — , Apr 06 '18 at 06:55
Did you tried Backprogation Through Time (BPTT) ? Need more information about your net and task to help you — user3352632, Apr 09 '18 at 16:28
Thanks for your comment and suggestion. I changed the layer from tf.contrib.rnn.LSTMBlockCell to tf.contrib.rnn.LayerNormBasicLSTMCell. Then the gradients become large enough to influence the network. — Tom Z, Apr 11 '18 at 05:29

score 3 · Answer 1 · answered Jun 13 '18 at 01:32

Vanishing gradient is a common problem in RNN.

A common way to deal with it is the method of gradient clipping (mainly you define a maximum and/ or a minimum threshold). see here for more information

Further information and piece of code to implement it can be found in SO here

Hope it helps !

score 2 · Answer 2 · answered Apr 11 '18 at 05:28

2

I changed the layer from tf.contrib.rnn.LSTMBlockCell to tf.contrib.rnn.LayerNormBasicLSTMCell. Then the gradients become large enough to influence the network.

answered Apr 11 '18 at 05:28

Tom Z

49
3

You need to add details about what exactly does that mean for those who don't know tensorflow – Apr 11 '18 at 05:50
It means that I normalize the input for the network layers. I hope this explains. – Tom Z Apr 13 '18 at 17:45

Too small gradient on large neural network

2 Answers2