Why aren't the BERT layers frozen during fine-tuning tasks?

Question

During transfer learning in computer vision, I've seen that the layers of the base model are frozen if the images aren't too different from the model on which the base model is trained on.

However, on the NLP side, I see that the layers of the BERT model aren't ever frozen. What is the reason for this?

Maybe you should provide the link to 2-3 examples (of implementations or papers/models) where the layers are not frozen (just to give more context). In any case, freezing the layers may not be necessarily required if you use a very low learning rate (but I am not an expert on this topic). — nbro, Oct 03 '20 at 14:38
I wouldn't say they are never frozen, just rarely. Generally, you fine-tune the entire model if you care more about accuracy (or whatever metric) while you freeze layers and fine-tune only the later layers if you care more about fine-tuning/prediction speed and memory usage. For example, fine-tuning the entire BERT on a GPU can take about 8GB of VRAM in my experience, which can be reduced by a gb or two by freezing layers. — primussucks, Oct 05 '20 at 17:20
@primussucks if we don't freeze bert new layers will mess up trained weights. isn't it so? — dato nefaridze, May 16 '21 at 14:01
@datonefaridze It will change the weights yes, but "mess up" suggests that the model will perform poorly if they aren't frozen, which just isn't true. — primussucks, May 18 '21 at 07:53

score 0 · Answer 1 · answered Jan 07 '23 at 15:50

Corrections and other answers are welcome, but here are a few thoughts:

There are several approaches in terms of which weights get frozen (and also other considerations, see for example Fig. 5 in "Galactica: A Large Language Model for Science").

Which of the approaches yields higher-quality results depends on the architecture (and hyperparameters) and dataset.

There can be rules of thumb, for example this old snapshot of a "Documentation" of Transformer architectures at Hugging Face said:

we are directly fine-tuning the whole model without taking any precaution. It actually works better this way for Transformers model

but this explanation apparently was removed from the new vesion of this page. Maybe it turned out that such rules of thumb aren't right in general.

Quality of results is also not the only thing being optimized. Some choices are made due to memory or compute considerations. For example, when freezing the first layers, their output features can be computed only once for all samples, saved, and used thereafter; moreover, computing the gradient of the loss with respect to the weights of the first frozen network block is not necessary.

Why aren't the BERT layers frozen during fine-tuning tasks?

1 Answers1