Why do smaller weights converge faster for RNNs?

Question

I am writing a Recurrent Neural Network using only the NumPy library for a binary classification problem. When I initialize the weights with np.random.randn, after 1000 epochs it gets ~60% accuracy, whereas when I divide the weights by 1000 first, it reaches 100% accuracy after the same amount of epochs.

Why is this? Do RNNs work better with smaller weights or does the number 1000 mean something?

Any and all help is welcome, thanks.

score 1 · Accepted Answer · answered Jul 22 '22 at 22:39

There is no magic value that work for every network but in general:

too large initial weights lead to exploding gradients (i.e. no convergence)
too small initial weights lead to vanishing gradients (i.e. small loss without real convergence)
the best initialization strategies use uniformly of normally distributed weights, and set an order of magnitude that depends on the amount of parameters in the previous layer. (see xavier initialization and kaiming initialization)

Check this blog post for pretty good animations of both problems.

Why do smaller weights converge faster for RNNs?

1 Answers1