I wonder whether there are heuristic rules for the optimal selection of learning rates for different layers. I expect that there is no general recipe, but probably there are some choices that may be beneficial.
The common strategy uses the same learning rate for all layers. Say, take Adam optimizer with lr=1e-4
, and this choice performs well in many situations.
However, it seems that convergence to the optimal values of weights in different layers may be with different speeds. Say, values in the first few layers are close to the optimum after a few epochs, whether features in deeper layers typically require much more epochs to be close to a good value.
Are there any rules to choose a smaller (higher) learning rate in the top layers of the network compared with the bottom layers?
Also, neural networks can have different types of layers - convolutional, dense, recurrent, self-attention. And some of them may converge faster or slower.
Has this question been studied in the literature?
Different learning rates for different layers emerge in transfer learning - it is common to tune only the last few layers, and keep other frozen or evolve with a smaller learning rate. The intuition behind this is that the top layers extract generic features universal for all models and it is desirable not to spoil them during fine-tuning.
However, my question is about training from scratch.