Transformers: how does stacking work?

Question

An Encoder has as inputs : Q,K,V, but has single output i.e. 3 vs 1

How do you stack those ?

Is there more detailed diagram ?

Pass the output through 3 different dense layers in parallel, I think. — user253751, Feb 28 '23 at 10:26

score 2 · Accepted Answer · answered Feb 28 '23 at 22:32

One encoder block of the transformer takes as input one tensor X and multiplies that by $W_Q$, $W_K$, $W_V$ to calculate $Q$, $K$, $V$ needed in self-attention.

After performing attention and feed-forward this one encoder block returns a single $X'$ ready to be taken as input for the next encoder block.

I find this specific post really helpful:

https://jalammar.github.io/illustrated-transformer/

You can find there this image, which shows that from one single input X, we calculate $Q, K, V$. Learning those weight matrices $W_Q$, $W_K$, $W_V$ is part of the training of an encoder.

Transformers: how does stacking work?

1 Answers1