5

In PyTorch, transformer (BERT) models have an intermediate dense layer in between attention and output layers whereas the BERT and Transformer papers just mention the attention connected directly to output fully connected layer for the encoder just after adding the residual connection.

Why is there an intermediate layer within an encoder block?

For example,

encoder.layer.11.attention.self.query.weight
encoder.layer.11.attention.self.query.bias
encoder.layer.11.attention.self.key.weight
encoder.layer.11.attention.self.key.bias
encoder.layer.11.attention.self.value.weight
encoder.layer.11.attention.self.value.bias
encoder.layer.11.attention.output.dense.weight
encoder.layer.11.attention.output.dense.bias
encoder.layer.11.attention.output.LayerNorm.weight
encoder.layer.11.attention.output.LayerNorm.bias
encoder.layer.11.intermediate.dense.weight
encoder.layer.11.intermediate.dense.bias

encoder.layer.11.output.dense.weight
encoder.layer.11.output.dense.bias
encoder.layer.11.output.LayerNorm.weight
encoder.layer.11.output.LayerNorm.bias

I am confused by this third (intermediate dense layer) in between attention output and encoder output dense layers

  • Welcome to SE:AI! I did a slight edit to draw attention to the question posed. – DukeZhou Oct 26 '21 at 00:22
  • Hello. Can you please provide the link to the PyTorch model that has an intermediate layer that you're referring to? – nbro Oct 26 '21 at 11:41
  • Hello thank you for your response. in fact each pre-trained model on hugging face has the same architecture. you can look at the following for example 'bert-base-uncased' at https://huggingface.co/bert-base-uncased (three fully connected layers. one in attention one intermediate and one output. for all encoder blocks) – mohammad ali Humayun Oct 26 '21 at 12:15
  • I am confused by this third (intermediate dense layer) in between attention output and encoder output dense layers. – mohammad ali Humayun Oct 26 '21 at 12:25

2 Answers2

5

Feedforward layer is an important part of the transformer architecture.

Transformer architecture, in addition to the self-attention layer, that aggregates information from the whole sequence and transforms each token due to the attention scores from the queries and values has a feedforward layer, which is mostly a 2-layer MLP, that processes each token separately: $$ y = W_2 \sigma(W_1 x + b_1) + b_2 $$

Where $W_1, W_2$ are the weights, and $b_1, b_2$ - biases, $\sigma$ - is the nonlinearity ReLU, GeLU, e.t.c.

enter image description here

This is kind of a pointwise nonlinear transformation of the sequence.

I suspect, that intermediate here corresponds to $W_1, b_1$ and the output is about $W_2, b_2$.

  • Hi, thank you for your response. I understand that the attention sublayer in the encoder has a fully connected dense layer followed by another output fully connected dense layer. but all pre-trained transformers at hugging face have three fully connected layers in each encoder block. one in attention one intermediate and one output. just for confirmation, are you suggesting that the MLP within attention (before skip connection) has two layers one of which is being referred to as the intermediate layer? thanx again – mohammad ali Humayun Oct 26 '21 at 12:18
  • 1
    The language "take a moment to think and process this information" is a seriously unhelpful and obfuscating way to describe a matrix multiplication. I would discourage people from unnecessary anthropomorphism of simple operations just because they are in the context of AI. – Jon Deaton Feb 16 '22 at 19:32
  • @mohammadaliHumayun The multihead-attention block has four linear layers - $W_Q$, $W_K$, $W_V$ and $W_O$. The layer $W_O$ is used to combine the concatenated output from the multiple heads. It is called `attention.output.dense` in the model that you are using. After the attention block you have a multi-layer perceptron (MLP). As explained in the answer, the MLP usually has two layers $W_1$ and $W_2$. So $W_1$ is called `intermidiate.dense` and $W_2$ is called `output.dense`. – pi-tau Aug 02 '23 at 09:08
0

documentaion of Bert on Hugging Face

you will find that

intermediate_size (int, optional, defaults to 3072) — Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.

so intermediate layer is feed-forward layer.

Ya Wen
  • 1