My understanding is that masked self-attention is necessary during training of GPT-2, as otherwise it would be able to directly see the correct next output at each iteration. My question is whether the attention mask is necessary, or even possible, during inference. As GPT-2 will only be producing one token at a time, it doesn't make sense to mask out future tokens that haven't been inferred yet.
1 Answers
Answer to Q1) If sampling for next token do you need to apply mask during inference?
Yes you do! The models ability to transfer information across positions was trained in this manner, and changing it up will have unpredictable consequences. Let my try to give an example:
Tokens: 1:sally, 2:sold, 3:seashells, 4:on, 5:the, 6:____
In the above you are trying to predict 6 from {1:5}
Denote $n^{(m)}$ as the set of tokens the $n^{th}$ positional embedding has info from at the $m^{th}$ layer.
In both cases we see that $n^{(0)} = \{n\} \ \ \forall n$. Now though with a mask we get $n^{(i)} = \{k\}_{k\leq n} \ \ \forall n \ \ s.t. \ \ i \geq 1$ but without we see $n^{(i)} = \{k\}_{k \in [1:N]} \ \ \forall n$. This difference means at the final layer the mebeddings going in will differ completely, and unless we train for such an approach it will cause error
Answer to Q2) What is the sample dimension?
It took me a couple reads to understand what youre asking for but I think I understand. The sample at each step is drawn from a distribution where its logits are linearly associated to a single embedding of dimension $d_{(model)}$ therefore that is our upper bound: $dim(sample) \leq d_{(model)}$ which in the example you gave is 768.

- 2,349
- 7
- 23
-
1Hi, about your example, i think you are describing how mask works but not why it's needed, right? as said in the question, in inference mode, there are no future tokens, so there is no tokens for k>n, and so k \in [1:N] is the same as k <= n. it seems no need mask in inference. is there anything wrong? – dingx Mar 16 '23 at 09:35