2

Pytorch's orthogonal initialization cites "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks ", Saxe, A. et al. (2013), which gives as reason for the usefulness of orthogonal initialization the fact that for a linear multi-layer network, the singular values of the product of all weight matrices are "isometric" i.e. all close to 1.

However, I can empirically see that if I use this initialization method for hidden layers that have a higher number of neurons than the input and output, then isometry goes away. E.g.

import torch, numpy
def g(r,c):
    w = torch.empty(r, c)
    return torch.nn.init.orthogonal_(w).numpy()
numpy.linalg.svd(g(5,7)@g(7,5))[1]
# Out[1]: 
# array([1.0000001 , 1.        , 0.99999994, 0.7571472 , 0.05615549],
#      dtype=float32)

My question: is orthogonal initialization still useful when hidden layer sizes vary? Should I be doing some additional things when dealing with such variations?

Gabi
  • 121
  • 2

0 Answers0