Why in Multi-Head Attention implementation should we use $3$ linear layers for Q, K, V instead of $3 * h$ layers?

Question

I have been trying to implement a Transformer architecture using PyTorch by following the Attention Is All You Need paper as well as the The Annotated Transformer blog post to compare my code with theirs. And I noticed that in their implementation of the Multi-Head Attention they have used three nn.Linear(d_model, d_model) to project the input of the encoder before splitting these projections into (n_heads, d_k) matrices for the attention. But as my understanding of the paper goes, we need to have n_heads of nn.Linear(d_model, d_k) for each of the queries, keys and values as we can see in the Multi-Head Attention's diagram here from the paper:

We clearly see as many nn.Linear layers as there are of heads. As well as the explanation of the authors:

Each $head_{i}$ uses $W_{i}^{Q}$, $W_{i}^{K}$ and $W_{i}^{V}$. So in my implementation I did this:

class MultiHeadedAttention(nn.Module):
  def __init__(self, d_model=512, h=8):
    super(MultiHeadedAttention, self).__init__()
    self.d_model = d_model
    self.h = h
    self.d_k = d_model // h
    self.query_linears = nn.ModuleList([nn.Linear(d_model, self.d_k) for i in range(h)])
    self.key_linears = nn.ModuleList([nn.Linear(d_model, self.d_k) for i in range(h)])
    self.value_linears = nn.ModuleList([nn.Linear(d_model, self.d_k) for i in range(h)])
    self.projection_layer = nn.Linear(h * self.d_k, d_model)

  def forward(self, Q, K, V, mask=None):
    batch_size = Q.size(0)
    queries = torch.cat([linear(Q).view(batch_size, 1, -1, self.d_k) for linear in self.query_linears], dim=1)
    keys = torch.cat([linear(K).view(batch_size, 1, -1, self.d_k) for linear in self.key_linears], dim=1)
    values = torch.cat([linear(V).view(batch_size, 1, -1, self.d_k) for linear in self.value_linears], dim=1)

    x = scaled_dot_product_attention(queries, keys, values, mask)

    x = x.transpose(1, 2)
    x = x.contiguous()
    x = x.view(batch_size, -1, self.h * self.d_k)
    x = self.projection_layer(x)
    return x

But I'm surely missing a key piece of understanding. And I'd be really grateful if someone can point it out to me.

Thank you.

score 3 · Accepted Answer · answered Jul 25 '23 at 17:55

3

It is just an optimization technique.

If you have a vector $x$ of size $d$ and you want to multiply with $n$ different matrices $W_i$ of shape $d \times d_k$, then you could simply stack these matrices along the last dimension and perform a single matrix multiplication.

A block view of this matrix operation would look like this:

\begin{equation} x \underbrace{ \begin{bmatrix} W_0 & W_1 & \cdots & W_{n-1} \end{bmatrix} }_{\text{stack along the last dim}} = \begin{bmatrix} xW_0 & xW_1 & \cdots & xW_{n-1} \end{bmatrix} \end{equation}

Now instead of looping over all the matrices, you actually perform the forward pass with a single vector-matrix multiplication $xW$, where $W$ has shape $d \times nd_k$.

Another thing that the authors of the paper this is that they chose $n$ and $d_k$ such that $d = nd_k$. So if $d=512$ and you want to have 8 heads in the multi-head attention layer, then you set $d_k=64$. This was done so that no matter how many heads you chose to have, you always have the same number of parameters in the model. I guess it is easier to do hyperparameter search this way, but you don't have to do it if you don't want to.

If you want to see a more detailed blog post about the implementation details of the transformer model feel free to check out this: https://pi-tau.github.io/posts/transformer/

answered Jul 25 '23 at 17:55

pi-tau

402
7

Thank you! So my implementation is correct too right? Can I ask you since you have written a blog post about it if you had difficulties training your transformer please? I have implemented one myself but it doesn't learn so I was trying to check each component individually. I thought that my MHA didn't get the gradients flowing well or anything but with the same logic the gradients should flow backwards in the same way whether with $h$ separate layers or not. – Daviiid Jul 25 '23 at 18:27
1

I see that you are using the torch function `scaled_dot_product_attention`. I had some problems making this work. For some reason it just outputs `nan`. So that is why I wrote this one myself. Maybe check if everything is ok with the outputs from this func. You can also see the implementation in the blog post, or even better, try to implement it yourself. – pi-tau Jul 25 '23 at 18:34
Yeah thank you! I have implemented the scaled dot product myself too, sorry I was being imprecise, and my masking too and I have unit tested them so I'm sure they work. Right now I'm going through each one of my component to "debug" them. I gave a look at your blog it's beautiful to the eye, a refined old computer style right? The code and the visualizations are clean too. I have to be honest with you, I added it to my bookmarks but I won't read it right now so I don't get lost between many implementations. (By the way if you can check out my code I'll make it worth your time ^^') – Daviiid Jul 25 '23 at 18:45
1

Just add a github link (or something like that) and I'll check it out – pi-tau Jul 25 '23 at 18:51
Oh thanks a lot! I have put all the code in this notebook: https://colab.research.google.com/drive/1ZDDfQrjcOl9rGkYmPXumBWX97Lw_hqUr?usp=sharing I put random examples in the beginning so you can use them if you want to check that the code works. Thank's a lot I'm really grateful for this help. – Daviiid Jul 25 '23 at 19:32
1

Code looks just fine. I can't see anything wrong with the model.This part were you copy the weights so that `target_embedder` and `target_projection` share the same weights looks a bit hacky and you should probably test if it propagates the gradients correctly – pi-tau Jul 25 '23 at 20:08
Thanks a lot. I'll definitely check it out. Thank you again. – Daviiid Jul 25 '23 at 20:13
You should try it without weight sharing and see how that goes. Weight init is very important. Your `target_projector` should use `xavier` init and now it uses standard gaussian init. I think that scaling the embed layer with `sqrt(d_model)` is to counter-act exactly that. Also in the `lr_func` you should raise the warm up steps to the power `-1.5` and not `-0.5`. Note that this function has to be used with `learning_rate=1.0` because it already outputs the correct learning rate. Add your training loop, maybe I wil lfind something there. Also allow me to edit so I can add comments. – pi-tau Jul 25 '23 at 20:24
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/147467/discussion-between-daviiid-and-pi-tau). – Daviiid Jul 25 '23 at 20:49

Why in Multi-Head Attention implementation should we use $3$ linear layers for Q, K, V instead of $3 * h$ layers?

1 Answers1