2

Recently, I came across the BERT model. I did some research and tried some implementations.

I wanted to tackle a NER task, so I chose the BertForSequenceClassifications provided by HuggingFace.

for epoch in range(1, args.epochs + 1):
    total_loss = 0
    model.train()
    for step, batch in enumerate(train_loader):
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        model.zero_grad()

        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
        loss = outputs[0]

        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

The main part of my fine-tuning follows as above.

I am curious about to what extent the fine-tuning alters the model. Does it freeze the weights that have been provided by the pre-trained model and only alter the top classification layer, or does it change the hidden layers that are contained in the already pre-trained BERT model?

nbro
  • 39,006
  • 12
  • 98
  • 176
Joon
  • 51
  • 1
  • 6

1 Answers1

3

Taken directly from HuggingFace

Note that if you are used to freezing the body of your pretrained model (like in computer vision) the above may seem a bit strange, as we are directly fine-tuning the whole model without taking any precaution. It actually works better this way for Transformers model (so this is not an oversight on our side). If you’re not familiar with what “freezing the body” of the model means, forget you read this paragraph.

nbro
  • 39,006
  • 12
  • 98
  • 176
Joon
  • 51
  • 1
  • 6