4

I am reading the BERT paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

As I look at the attention mechanism, I don't understand why in the BERT encoder we have an intermediate layer between the attention and neural network layers with a bigger output ($4*H$, where $H$ is the hidden size). Perhaps it is the layer normalization, but, by looking at the code, I'm not certain.

nbro
  • 39,006
  • 12
  • 98
  • 176
Jonor
  • 161
  • 4

1 Answers1

1

The paper Undivided Attention: Are Intermediate Layers Necessary for BERT? should answer it.

In the abstract, they write

All BERT-based architectures have a self-attention block followed by a block of intermediate layers as the basic building component. However, a strong justification for the inclusion of these intermediate layers remains missing in the literature.

In the conclusion, they write

In this work we proposed a modification to the BERT architecture focusing on reducing the number of intermediate layers in the network. With the modified BERTBASE network we show that the network complexity can be significantly decreased while preserving accuracy on fine-tuning tasks.

nbro
  • 39,006
  • 12
  • 98
  • 176
Tan Eugene
  • 11
  • 1
  • So, why are "these intermediate layers" used? You're quoting the abstract that says that an explanation in the literature is missing, but do the authors of this paper provide it? – nbro Dec 15 '21 at 09:26
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 23 '21 at 02:24