2

I've had it in my head that generally speaking, it's better to freeze layers when fine-tuning an LLM, as per this quote from HuggingFace's article:

PEFT approaches only fine-tune a small number of (extra) model parameters while freezing most parameters of the pretrained LLMs, thereby greatly decreasing the computational and storage costs. This also overcomes the issues of catastrophic forgetting, a behaviour observed during the full finetuning of LLMs. PEFT approaches have also shown to be better than fine-tuning in the low-data regimes and generalize better to out-of-domain scenarios. It can be applied to various modalities, e.g., image classification and stable diffusion dreambooth.

I think what I might be confused by is what is meant by the "(extra)" part. It led me to try fine-tuning a BERT model in PyTorch by freezing all parameters except for the final feed-forward of the transformer responsible for sequence classification:

for param in model.parameters():
    param.requires_grad = False

for param in model.classifier.parameters():
    param.requires_grad = True

However, this caused my model to get significantly worse evaluation metrics on my test set than before I did this. This lead me to the following conclusions:

  • My dataset of ~100K datapoints is not of a "low-data regime" and therefore doesn't benefit from PEFT? But doesn't it say this generalizes better to "out-of-domain scenarios"? How do I know the particular seq classification I'm doing with BERT is out-of-domain? Because it isn't specifically a next-sequence prediction task?

  • Is this the cost of misinterpreting the "(extra)" model parameters part? I'm fine-tuning a small number of extant model parameters here, not extra.

I'm just confused here. The quote I've showed here makes me believe my PEFT model should've outperformed a regular fine-tuning.

1 Answers1

0

I have seen researchers use both approaches. Typically freezing is useful if you have very few examples (<100 in my cases), Otherwise

  1. Do you have enough resources to Fine-Tune with an unfrozen model?
  2. Is performance better with an unfrozen or frozen model?

Many people have turned to LoRA to avoid this issue. https://github.com/microsoft/LoRA

Thomas K
  • 31
  • 5
  • Why would an unfrozen model get "messed up" easier by a dataset with few examples? – multiheadedattention May 19 '23 at 17:40
  • What defines an "easier" dataset is unclear, but you must have enough samples to represent the true distribution, otherwise, your model will be underfitting. Simply put, small datasets are more difficult to generalize from, therefore unfreezing the weights might lead to degradation in performance. – Thomas K May 22 '23 at 18:04