I've had it in my head that generally speaking, it's better to freeze layers when fine-tuning an LLM, as per this quote from HuggingFace's article:
PEFT approaches only fine-tune a small number of (extra) model parameters while freezing most parameters of the pretrained LLMs, thereby greatly decreasing the computational and storage costs. This also overcomes the issues of catastrophic forgetting, a behaviour observed during the full finetuning of LLMs. PEFT approaches have also shown to be better than fine-tuning in the low-data regimes and generalize better to out-of-domain scenarios. It can be applied to various modalities, e.g., image classification and stable diffusion dreambooth.
I think what I might be confused by is what is meant by the "(extra)" part. It led me to try fine-tuning a BERT model in PyTorch by freezing all parameters except for the final feed-forward of the transformer responsible for sequence classification:
for param in model.parameters():
param.requires_grad = False
for param in model.classifier.parameters():
param.requires_grad = True
However, this caused my model to get significantly worse evaluation metrics on my test set than before I did this. This lead me to the following conclusions:
My dataset of ~100K datapoints is not of a "low-data regime" and therefore doesn't benefit from PEFT? But doesn't it say this generalizes better to "out-of-domain scenarios"? How do I know the particular seq classification I'm doing with BERT is out-of-domain? Because it isn't specifically a next-sequence prediction task?
Is this the cost of misinterpreting the "(extra)" model parameters part? I'm fine-tuning a small number of extant model parameters here, not extra.
I'm just confused here. The quote I've showed here makes me believe my PEFT model should've outperformed a regular fine-tuning.