Has the idea of using different learning rates for different layers been explored in the literature?

Question

I wonder whether there are heuristic rules for the optimal selection of learning rates for different layers. I expect that there is no general recipe, but probably there are some choices that may be beneficial.

The common strategy uses the same learning rate for all layers. Say, take Adam optimizer with lr=1e-4, and this choice performs well in many situations.

However, it seems that convergence to the optimal values of weights in different layers may be with different speeds. Say, values in the first few layers are close to the optimum after a few epochs, whether features in deeper layers typically require much more epochs to be close to a good value.

Are there any rules to choose a smaller (higher) learning rate in the top layers of the network compared with the bottom layers?

Also, neural networks can have different types of layers - convolutional, dense, recurrent, self-attention. And some of them may converge faster or slower.

Has this question been studied in the literature?

Different learning rates for different layers emerge in transfer learning - it is common to tune only the last few layers, and keep other frozen or evolve with a smaller learning rate. The intuition behind this is that the top layers extract generic features universal for all models and it is desirable not to spoil them during fine-tuning.

However, my question is about training from scratch.

Consider the simpler case where you have a single learning rate. Are there any rules for choosing the learning rate? As far as I know this can only be answered by hyperparameter optimization (or, if you have done hyperparameter optimization on a similar problem; use the previously found learning rate). Now clearly this situation is a special case of your question. It then seems to me that hyperparameter optimization is the only solution. — Taw, Aug 19 '21 at 17:14
@Taw - yes, even for a single learning rate the optimal choice is apriori unknown. However, there are some reasonable guesses to start with - lr=1e-4 with Adam is optimal for many CNN, trained on ImageNet, but the transformers usually need smaller learning rates for stable training, like 1e-5 with noam scheduler. And smal models, of size ~1k-10k work best with lr=1e-3 for Adam in my practice. — spiridon_the_sun_rotator, Aug 19 '21 at 19:02

Has the idea of using different learning rates for different layers been explored in the literature?

0 Answers0