1

I have been looking for the answer in other questions but no one tackled that. I want to ask you how is the padding mask considered in the formula of attention?

The attention formula taking into account a causal mask is: $Attention(Q, K, V) = softmax(\frac{QK^{T} + CausalMask}{\sqrt{d_{k}}})V$

But how do we add the padding mask? The aim of a padding mask is to mask the padding positions as they're used just to make the batching feasible. But I don't know how this mask is added in the Attention formula.

Does it make sense if we do element-wise multiplication of the Attention matrix with a tensor of ones of shape (batch size, sequence length, $d_{model}$) and whatever sentence $s$ in that batch and wherever position $p$ is a padding token then the tensor[s, p, :] is zeros?

Thank you in advance for your help!

Daviiid
  • 563
  • 3
  • 15

2 Answers2

2

Entries of an attention mask are typically either $0$ or $-\infty$.

So, adding such a mask gives either the original entry of $QK^T$ or $-\infty$.

The issue with entrywise multiplication with a binary matrix, is that $0$ values still contribute to softmax.

$$softmax(z)_i = \frac{e^{z_i}}{\sum_k e^{z_k}}$$

$e^0$ is $1$, so element-wise product before softmax is not really a mask. After softmax doesn't work either, since the output is no longer a probability distribution. To ensure the masked elements do not contribute at all, you need them to be $-\infty$, which mask addition does.

If you know that certain indices of the input of are padded, that is, they are all zero vectors, then the padding tokens are already ignored by matmul (since they are zero). The issue is the rows of $QK^T$ corresponding to the pad tokens are zero and not $-\infty$.

To ignore pad tokens, you would create a mask with $-\infty$ along columns corresponding to the location of padding, and add this to $QK^T$ before softmax. So it's just the same attention formula, different value of $M$.

Venna Banana
  • 376
  • 3
  • 1
    Thank you for your answer. You're totally right about the multiplication didn't give it much thought. But I have a question about this type of masking for the padding tokens. If we put $-\infty$ in rows corresponding to the pad tokens, the softmax will take in rows that are all $-\infty$ and will then output a row of nan values. I reckon that we can turn those nan values into $0$ but I wonder if it is how it's being done in deep learning frameworks like PyTorch etc. – Daviiid Jul 20 '23 at 19:36
  • 1
    @Daviiid In the case of TensorFlow, they mask for infinite values (floats with full exponent and empty mantissa) in implementing exp. The other frameworks probably do the same thing. – Venna Banana Jul 20 '23 at 19:55
  • thank you for your answer. Can I ask you if that is equivalent mathematically to replacing those rows where we should have nan values, in a normal implementation, with 0? I'm trying to understand the effect of having a row full of $-\infty$ (the rows corresponding to padding tokens) on the multiplication of the values $V$, by the softmax, since having such rows will lead in a "normal" setting to an undefined value output from the softmax because of $0/0$ – Daviiid Jul 20 '23 at 20:04
  • 1
    Well, you will only see a $0$ in the denominator if all entries are $-\infty$, not just some rows. That is, your entire sequence would have to be padding. In this case, the framework will return some NaN symbol, and the network evaluation will not work. I think the formula I used for softmax in the answer could be clearer. Edited. – Venna Banana Jul 20 '23 at 21:10
  • Yeah you're right. Well, we do apply the softmax per row right? And we put $-\infty$ in the rows corresponding to pad token positions right? So we get a whole row of $-\infty$ that's fed to a softmax per row. Maybe this last part is wrong. If $Q$ is of shape (bsz, seq len, d_m) and the same for $K$, we get $QK^{T}$ of shape (bsz, seq len, seq len), will the padding mask, of shape (seq len, seq len) have $-\infty$ column-wise? E.g. if all sentences have a pad token at the end then will the last column of the padding mask be $-\infty$, or, will the last row be all $-\infty$? – Daviiid Jul 20 '23 at 21:27
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/147362/discussion-between-venna-banana-and-daviiid). – Venna Banana Jul 20 '23 at 21:36
1

For a given sequence x you calculate the attention scores using the formula:

\begin{equation} A = \frac{(xQ K^Tx^T)}{\sqrt{d_k}}, \end{equation} where $Q, K$ are the query and key matrices of the attention layer.

The result is a square matrix of size $T \times T$ where $T$ is the length of the sequence $x$. The entry $A_{i,j}$ gives the attention score between $x_{i}$ and $x_{j}$ (note $A_{i,j} \neq A_{j,i}$). So basically row $i$ gives you the attention scores for token $x_i$, which other tokens of the sequence it should attend to. However, you actually want to use these scores to perform a weighted average over the value encodings given by $xV$ ($V$ is the value matrix of the attention layer). But the scores can be arbitrary real numbers, and you want to have positive weights that sum to $1.$ That is why we apply a softmax layer to convert the scores into attention weights.

Now, if your sequence contains pad tokens you don't want $x_i$ to attend to them. So you want to "remove" the attention between $x_i$ and the pad tokens. You could set the attention weights directly to $0.$, but then the sum will not be $1.$ You actually want to mask the attention before applying the softmax function. Thus, you set the attention scores between $x_i$ and the pad tokens to a large negative number. Now applying the softmax will produce $0.$ attention weights. You can set the scores to -float("inf"), but I think setting them to $-1e9$ is more than enough.

For a concrete example see this github repo. Also here is an extensive blog post that I wrote about the Transformer, you might like it.

pi-tau
  • 402
  • 7