In the original transformer paper, the attention mechanism uses parameter matrices, but no bias terms. However, in more recent implementations I see people often using a bias term when computing "key", "query", and "value". For example, in Andrej Karpathy's recent implementation of GPT, whether a bias term is used can be determined in the config:
bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
This makes me wonder whether there is any evidence that the bias terms help. In particular, if, according to Karpathy, not using bias is "a bit better and faster", why is he using them by default?