0

In the original transformer paper, the attention mechanism uses parameter matrices, but no bias terms. However, in more recent implementations I see people often using a bias term when computing "key", "query", and "value". For example, in Andrej Karpathy's recent implementation of GPT, whether a bias term is used can be determined in the config:

bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster

This makes me wonder whether there is any evidence that the bias terms help. In particular, if, according to Karpathy, not using bias is "a bit better and faster", why is he using them by default?

nbro
  • 39,006
  • 12
  • 98
  • 176
Tarvoc
  • 1

1 Answers1

0

I guess that bias terms have the ability to increase the expression power of the model.According to the paper:"BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models".

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 29 '23 at 13:51