2

From what I understand, Transformer Encoders and Decoders use a fixed number of tokens as input, e.g., 512 tokens. In NLP for instance, different text sentences have a different number of tokens, and the way to deal with that is to truncate the longer sentences and pad the shorter ones. As an additional input, a padding mask must be supplied to the Transformer so that its attention is only focused on the relevant tokens.

My question is: Is there something in the architecture that forces the transformer to have a fixed number of tokens as input? (and not adopt dynamically to the actual input length like RNNs for instance?)

For comparison, I think of fully-convolutional networks or RNNs with variable input lengths. They are agnostic to the actual input dimension because they perform pointwise operations on the different patches. When applying an RNN model to an n-tokens sentence, you compute the same block n times, and when computing it on a k-tokens sentence you will apply it k times. So this architecture does not require padding or truncating (at least not in theory, I do not refer here to implementation considerations). In transformers: embedding the tokens, computing attention, and feed-forward can be performed on different lengths of sequences since the weights are applied per token, right? So why do we still truncate and pad to a fixed size? Or perhaps it is feasible but not implemented in practice for other reasons?

I must be missing something...

I'll ask it differently to make my question more clear: Say I have an already-trained transformer model, trained on 512 fixed-sized inputs (truncated and padded). At inference time, if I would like to process a single, shorter sentence. Do I have to pad it or not?

Thanks

A. Maman
  • 121
  • 5
  • 2
    It's important to remember that all machine learning and deep learning is at the core matrix algebra and multiplications. So it makes lot of sense to have fixed dimensions of matrices. – tired and bored dev Oct 25 '22 at 14:05
  • You mean that the reason for forcing a fixed sentence length is for computation performance? – A. Maman Oct 26 '22 at 06:48

2 Answers2

3

Edits to reflect edits in question: If you train your transformer on length = n then yes, you need to pad inputs to length = n. This is not a requirement in the mathematical architecture, it's a requirement in the implementation.


There seem to be two separate ideas in your question:

  1. Why do transformers have a fixed input length?
  2. Why do transformers have input length limitations?

I am not sure which one you are asking, so I will answer both.


1) Saying transformers have a fixed input length is misleading.

Transformers accept variable length inputs just like RNNs. You can think of padding/truncating as an extra embedding step if you want.

We don't need padding in RNNs because they process inputs (sentences) one element (token/word) at a time.

Transformers process inputs all at once. If you are passing in several sentences, you have to do something to regularize sentence length, hence padding and truncating.


2) Transformers often limit input length to 512 or 1024 because of performance issues.

If you are wondering why we don't let our transformers accept inputs as long as possible, the answer is that there are computational and performance limitations. The algorithm complexity is quadratic.

This is where the max_length parameter of a transformer comes in. If your input has length 1,000, the transformer will throw an error because it can only handle inputs up to 512 in length.

waxalas
  • 89
  • 7
  • I’m not sure I understood what you mean. If it’s the tokenization algorithm that you’re talking about, so that is not what I’m asking. I doubt the limit on the sequence length of the model (ie the number of tokens, regardless of the embedding dimension or the vocabulary size. – A. Maman Oct 26 '22 at 06:51
  • @A.Maman I just updated my answer to be more detailed. I hope it helps, but if not, you might want to provide an example in your question so we know exactly what you are asking :) – waxalas Oct 26 '22 at 15:30
  • @A.Maman I just realized maybe the answer to your question is that RNN are recurrent, processing one word at a time along with the previous state. They do "pointwise operation" like you said. Transformers try to solve this extremely expensive method by processing whole sentences at a time, without the need for recursion. That's why you need to choose an input length from the start, because you are feeding the complete sentence at once. – waxalas Oct 26 '22 at 16:11
  • Hi! thanks for the answer! I understand that `max_length` is not constant in a sense of being numerically identical among all transformers and that it is actually a parameter of the architecture. That's why you're right that it is more accurate to say that it is "fixed" and not "constant". I edited my question to try making it more clear but what I basically ask is: if we disregard training matters, do transformers know to process different lengths of inputs or not? because how I understand the architecture, they do. And if so, whey do I have to pad a short sentence to process it?thanks! – A. Maman Oct 27 '22 at 10:58
  • regarding the matrix multiplication explanation: I understand that weight matrices have fixed dimensions. But all those matrices in transformers work pointwise on each token (embedding matrix, query, key, value projections, and feed-forward is performed pointwise as well) , and the token embedding dimension remains the same regardless of the sequence length. So that should not be a problem for processing variable lengths. – A. Maman Oct 27 '22 at 11:05
  • is `attention can be performed on each token independently` correct? self-attention is a contextual matrix relating all tokens in the same input (sentence) with each other. this is dependent on input length. i have a feeling this might be the answer we're looking for (doing lots of reconsidering at this stage, so i'm grateful for your question haha). at the end of the day, what is the transformer training? that's the piece we're going to reuse in prediction which dictates dimensions. well, it's a set of weights relating inputs to outputs, "fixing input length." – waxalas Oct 28 '22 at 23:20
  • additionally, perhaps an RNN can "process an input of any length" because it's processing one word at a time, so it's actually processing many inputs of length 1 sequentially (="fixed length of 1")? – waxalas Oct 28 '22 at 23:24
  • Hi, you're right - this sentence was not accurate, so I've changed it in the question. I meant to claim that applying the weights in all layers (including attention layer) is done per token. In self-attention for instance, you compute the Q, K, V projections (which can be done on any length) and then compute the dot product between all pairs, which can also be done on different lengths. That point is that self-attention can be computed on two different lengths in two adjacent iterations. – A. Maman Oct 30 '22 at 07:41
  • ok so your question is can the Q, K, V projections have different dimensions for each different sample in your training data? (particularly, is this possible from a theoretical standpoint and not computationally, because we know that computationally, we need to perform matmul on a constant dimension throughout the problem). is this what you are asking? – waxalas Oct 31 '22 at 02:19
  • (1) Hi, No. The weights matrices' ($W_q$; $W_k$; $W_v$) dimensions are $d_{emb}$ on $d_q$, $d_k$ and $d_v$ respectively. None of those dimensions rely on the sequence length. In the attention mechanism, those projections apply to each token, not one at a time, but the same projection on each token (you can see it as a weight sharing). This does occur in parallel on each token, but can be done on an n-long sequence as it does on a k-long sentence - you take each token and compute the projection (since each projection is simply taking the individual token and multiplying it with a matrix). – A. Maman Oct 31 '22 at 09:15
  • (2) Because the projections are computed at the token level, there is no problem computing it on 16 tokens or on 512 tokens (it's simply taking the sentence matrix of dimensions $16\times d_{emb}$ and multiplying it with $W_q$ for instance which is $d_{emb}\times d_q$). The result in the first case will be $16\times{}d_q$ and in the second case $512\times d_q$. But that is not a problem since the attention is computed by softmax($Q\times K^T)\times V$, so the dimensions of the activation matrix will be $16\times 16$, which you multiply by $V$ ($16\times d_v$). – A. Maman Oct 31 '22 at 09:16
  • (3) The whole computation will results in a $16\times d_v$ matrix, or $n\times d_v$ in general for variable $n$. So when computing attention on n-long sentence you get n-long contextualized representation sequence which is effectively what happens when you truncate -> pad -> mask. The point is that the loss eventually computes only on the non-masked tokens, so I'm asking why to do this padding (except for the implementation/computation considerations) and not simply compute it with dynamically changing input-output length? – A. Maman Oct 31 '22 at 09:16
  • don't forget we're not processing one sentence at a time, but batches of sentences. so the "whole computation" results in a $batchsize \times n \times d_{v}$ matrix. you can't have variable lengths of $n$ in this batch matrix. additionally, once you tell your model to expect a fixed value of $n$ here, you have to provide predictors of the same length, otherwise you'll get tensor shape errors. – waxalas Oct 31 '22 at 22:20
  • I believe wav2vec is a variable length transformer? – Tom Huntington Nov 08 '22 at 22:09
  • This answers the question but doesn't really resolve the confusion. If we limit the context length to 1024 for performance reasons, why do we always talk about context length as though it's a hard parameter of the architecture? For example, you wouldn't talk about batch size as though it's an architectural parameter. The higher the batch size, the more memory you'll need, sure - but you wouldn't say "hey guys, this architecture *cannot support a batch of more than 128 sentences*. Its *max batch size* is 128". That would be ridiculous. – Jack M Aug 17 '23 at 10:40
1

To add something to pip.pip answer (thumbs up cause is totally on point), consider that transformers can't be fully convolutional, since as the name suggest, a fully convolutional model perform only convolutions, while transformers include dense layers, which expect a fixed input dimension.

Despite being possible to overcome the limit of fixed dimensionality imposed by the dense layers, for example by using pyramid pooling, that would only add complexity to the training regime and there's no guarantee that the performance will increase.

lastly, from a linguistic perspective, 512 and 1024 tokens are already quite a lot to learn most long dependencies (which let's recall was the main issue why transformers were introduced in place of RNNs). So the game of making transformers input size independent is not really worth the effort.

Edoardo Guerriero
  • 5,153
  • 1
  • 11
  • 25
  • Hi @Edoardo Guerriero Thanks! I didn't mean to ask whether transformers can be fully-convolutional or not. I meant to compare fully-convolutional networks' ability to process different sizes of input resolutions to transformers' in-practice training method of processing fixed-size sequence lengths. That is: can the same instance of a transformer process n-sequence input and k-sequence input without padding. And I mainly ask whether this is theoretically possible by the architecture and not from memory/compute resources or implementation perspective. Thanks again! – A. Maman Oct 27 '22 at 11:11