5

I've been looking into self-attention lately, and in the articles that I've been seeing, they all talk about "weights" in attention. My understanding is that the weights in self-attention are not the same as the weights in a neural network.

From this article, http://peterbloem.nl/blog/transformers, in the additional tricks section, it mentions,

The query is the dot product of the query weight matrix and the word vector, ie, q = W(q)x and the key is the dot product of the key weight matrix and the word vector, k = W(k)x and similarly for the value it is v = W(v)x. So my question is, where do the weight matrices come from?

nbro
  • 39,006
  • 12
  • 98
  • 176
Mark
  • 233
  • 1
  • 6

2 Answers2

3

The answer is actually really simple: they are all randomly initialised. So they are to all intents and purposes "normal" weights of a neural network.

This is also the reason why in the original paper the authors tested several setting with single and multiple attention heads. If these matrices were somehow "special" or predetermined they would all serve the same purpose. Instead, because of their random initialisation, each attention head learn to contribute to solve a different task, like they show in Figure 3 and 4.

Edoardo Guerriero
  • 5,153
  • 1
  • 11
  • 25
  • So how are those weight matrices updated? Is there a separate neural network for that? – Mark Sep 01 '20 at 02:16
  • 1
    Why do you need another neural network? You just back propagate everything till these very first matrices. You can see from the tensorflow documentation that the matrices are just dense layers (or conv1D layers) https://github.com/tensorflow/tensor2tensor/blob/d9f807cf2738323d19aba0a20a8cf0c7f7da8b27/tensor2tensor/layers/common_attention.py#L2193 – Edoardo Guerriero Sep 01 '20 at 10:55
0

In my mind there are two weight matrices, the one you get prior to applying softmax:

$$ \alpha_{i,j} = \frac{\langle q_i, k_j \rangle}{\sqrt{d}}$$

the other you get after applying the softmax:

$$ \text{Attn}_{i,j}(X) = \frac{\exp \left( \frac{\langle q_i, k_j \rangle}{\sqrt{d}}\right)}{\sum_k \exp\left( \frac{\langle q_i, k_k \rangle}{\sqrt{d}} \right)}.$$

Either way, you can view them as directed "attention graphs" and they can be interpreted in terms of graph neural networks using only complete graphs and graph attention:

$$ A_{i,j} = \frac{\exp(w(i,j))}{\sum_{j \in \mathcal{N}(i)} \exp(w(i,k))},$$

where $w(i,j)$ is a weight function, which can be the scaled dot-product of attention for example. The sum is over the neighborhood of a node (think token) $\mathcal{N}(i)$. Hope that helps a little with intuition and that it's close to what you meant.