Corrections and other answers are welcome, but here are a few thoughts:
There are several approaches in terms of which weights get frozen (and also other considerations, see for example Fig. 5 in "Galactica: A Large Language Model for Science").
Which of the approaches yields higher-quality results depends on the architecture (and hyperparameters) and dataset.
There can be rules of thumb, for example this old snapshot of a "Documentation" of Transformer architectures at Hugging Face said:
we are directly fine-tuning the whole model without taking any precaution. It actually works better this way for Transformers model
but this explanation apparently was removed from the new vesion of this page. Maybe it turned out that such rules of thumb aren't right in general.
Quality of results is also not the only thing being optimized. Some choices are made due to memory or compute considerations. For example, when freezing the first layers, their output features can be computed only once for all samples, saved, and used thereafter; moreover, computing the gradient of the loss with respect to the weights of the first frozen network block is not necessary.