Can the decoder in a transformer model be parallelized like the encoder?

Question

As far as I understand, the encoder has all the tokens in the sequence to compute the self-attention scores. But for a decoder, this is not possible (in both training and testing), as self-attention is calculated based on previous timestep outputs. Even if we consider some techniques, like teacher forcing, where we are concatenating expected output with obtained, this still has a sequential input from the previous timestep.

In this case, apart from the improvement in capturing long-term dependencies, is using a transformer-decoder better than say an LSTM, when comparing purely on the basis of parallelization?

HLeb · Accepted Answer · 2021-04-14T07:32:02.637

17

Can the decoder in a transformer model be parallelized like the encoder?

Generally NO:

Your understanding is completely right. In the decoder, the output of each step is fed to the bottom decoder in the next time step, just like an LSTM.

Also, like in LSTMs, the self-attention layer needs to attend to earlier positions in the output sequence in order to compute the output. Which makes straight parallelisation impossible.

However, when decoding during training, there is a frequently used procedure which doesn't take the previous output of the model at step t as input at step t+1, but rather takes the ground truth output at step t. This procudure is called 'Teacher Forcing' and makes the decoder parallelised during training. You can read more about it here.

And For detailed explanation of how Transformer works I suggest reading this article: The Illustrated Transformer.

Is using a transformer-decoder better than say an lstm when comparing purely on the basis of parallelization?

YES:

Parallelization is the main drawback of RNNs in general. In a simple way, RNNs have the ability to memorize but not parallelize while CNNs have the opposite. Transformers are so powerful because they combine both parallelization (at least partially) and memorizing.

In Natural Language Processing for example, where RNNs are used to be so effective, if you take a look at GLUE leaderboard you will find that most of the world leading algorithms today are Transformer-based (e.g BERT by GOOGLE, GPT by OpenAI..)

For better understanding of why Transformers are better than CNNs I suggest reading this Medium article: How Transformers Work.

edited Apr 14 '21 at 07:32

answered May 24 '19 at 10:57

HLeb

549
5
10

2

This answer is misleading since it does not mention the fact that the decoder part of a Transformer is parallelizable during training. – Mathias Müller Jan 29 '20 at 14:28
2

Thanks for the note @MathiasMüller. However, while the decoder can be parallelized during training using the 'already known' trick, to my knowledge this will not have the same result because you can replace by a word that isn't the same as the one the decoder will predict. And this will affect the model differently during back-propagation. So my answer concerns the general understanding of transformer decoder and actual ability to parallelize without a trick that will affect the model. Please clarify if I've mistaken smthg. – HLeb May 06 '20 at 13:24
1

No, this is not a trick that changes the training procedure: all implementations of standard Transformers compute all positions in the same layer in parallel (for both encoder and decoder during training, for encoder during translation). This does not affect the model at all: the results for each position are mathematically independent. – Mathias Müller May 06 '20 at 14:03
4

... In case you meant that during training, actual predictions by the model are not used to build up the target sequence: this is also not a trick I would say, but a standard procedure called "teacher forcing" that is used in virtually all supervised sequence prediction models. – Mathias Müller May 06 '20 at 15:34
1

Thanks @MathiasMüller for the clarification. I edited the answer to include that. – HLeb Apr 14 '21 at 07:38

Mathias Müller · Answer 2 · 2020-11-12T12:39:55.133

9

Can the decoder in a transformer model be parallelized like the encoder?

The correct answer is: computation in a Transformer decoder can be parallelized during training, but not during actual translation (or, in a wider sense, generating output sequences for new input sequences during a testing phase).

What exactly is parallelized?

Also, it's worth mentioning that "parallelization" in this case means to compute encoder or decoder states in paralllel for all positions of the input sequence. Parallelization over several layers is not possible: the first layer of a multi-layer encoder or decoder still needs to finish computing all positions in parallel before the second layer can start computing.

Why can the decoder be parallelized position-wise during training?

For each position in the input sequence, a Transformer decoder produces a decoder state as an output. (The decoder state is then used to eventually predict a token in the target sequence.)

In order to compute one decoder state for a particular position in the sequence of states, the network consumes as inputs: 1) the entire input sequence and 2) the target words that were generated previously.

During training, the target words generated previously are known, since they are taken from the target side of our parallel training data. This is the reason why computation can be factored over positions.

During inference (also called "testing", or "translation"), the target words previously generated are predicted by the model, and computing decoder states must be performed sequentially for this reason.

Comparison to RNN models

While Transformers can parallelize over input positions during training, an encoder-decoder model based on RNNs cannot parallelize positions. This means that Transformers are generally faster to train, while RNNs are faster for inference.

This observation leads to the nowadays common practice of training Transformer models and then using sequence-level distillation to learn an RNN model that mimicks the trained Transformer, for faster inference.

edited Nov 12 '20 at 12:39

answered Jan 29 '20 at 13:29

Mathias Müller

351
3
13

it seems your definition of "parallelization" seems to be very different from the actual definition of Parallelization. We say MLP and CNN are Parallelizable because the ith output node produces output independent of other output nodes. That doesn't happen in RNN/LSTM/decoder of transformer. Hence the architecture cannot produce parallel outputs, hence not Parallelizable. Though data/batches can be fed to such architectures in parallel. But that is technically "data parallization", not Parallelization. – Ritwik Nov 12 '20 at 10:35
1

@Ritwik As I explain in my answer, all elements in an output sequence inside a particular layer of a Transformer decoder are produced in parallel **during training, but not during sequence generation**. What is your "actual definition of parallelization"? – Mathias Müller Nov 12 '20 at 12:38
isn't sequence generation part of the training process only? and architecture is "parallelizable" if it generates outputs by simple mathematical operations like matmul, max, min, pool, etc. In RNN/decoder of transformer an additional loop is needed for each output word, next output depends on previous output, hence not parallelizable. I will share the reference to this, in some time. – Ritwik Nov 12 '20 at 16:40
@Ritwik "isn't sequence generation part of the training process only?" then what do you call producing a sequence after training is finished? "and architecture is "parallelizable" if it generates outputs by simple mathematical operations like matmul, max, min, pool, etc." - no, that is not an accurate definition of parallelizable. Parallelizable in the context of Transformers here means that a certain operation can be run independently for each item in a sequence. It does not matter at all what kind of mathematical operation is performed. – Mathias Müller Nov 12 '20 at 16:57
"RNN/decoder of transformer an additional loop is needed for each output word, next output depends on previous output, hence not parallelizable." - yes, but for a Transformer decoder that is _not_ true during training. – Mathias Müller Nov 12 '20 at 16:58
if sequence generation is _not_ part of training then how exactly loss is calculated in the decoder (while training)? – Ritwik Nov 13 '20 at 06:38
@Ritwik What I meant is: yes, you can think of sequence generation to _also_ happen during training, but during training a decoder is usually not fed its own previous predictions when it has to generate the output distribution for the next time step. Here: "during training, but not during sequence generation" I use "sequence generation" in the sense of "producing sequences after training is finished". – Mathias Müller Nov 13 '20 at 07:27
quoting directly from the paper "Given z, the decoder then generates an output sequence of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next." The paper doesn't differentiate between the training and testing paradigms of the decoder. – Ritwik Nov 13 '20 at 09:39
@Ritwik If you are referring to the paper "Attention is All You Need": even if in this particular sentence, there is no distinction between training and testing, this does not mean that they are identical. In most cases, Transformer decoders are trained with a technique called "teacher forcing" that does not feed the decoder its own previous outputs during training. You are claiming that this is not true without any evidence. – Mathias Müller Nov 13 '20 at 11:07
@Ritwik "Attention is All You Need" definitely uses teacher forcing in training, even though it's not obvious from the sentence you quoted. For example, take a look at https://github.com/tensorflow/tensor2tensor/issues/695 (I now it doesn't prove that's what the original paper did, but this is the codebase referenced in the paper). – max Mar 18 '21 at 07:08
I'm still confused about paralelization. As I understand teacher forcing as it is used ground truth instead of model prediction, but still sequentialy or? Lets say that we want translate I'm dog to Ich bin hund. As I imagine it is encoder process I'm dog at once and provide representation to decoder part. Decoder takes start token and produce something, than I provide decoder start token and Ich token(teacher forcing, using truth instead of output) and so on. But your description looks like decoder create Ich bin hund in one run. May you guide me what I missing? :) – viceriel May 19 '21 at 21:34
1

@viceriel The crucial bit is that there is a difference between a) training and b) using a trained model for translation after training. During a) the decoder side can be parallelized more than during b). During a) yes, all positions of the target sentence can be processed in parallel. – Mathias Müller May 20 '21 at 07:42

score 2 · Answer 3 · answered Feb 22 '21 at 22:38

2

Can't see that this has been mentioned yet - there are ways to generate text non-sequentially using a non-autoregressive transformer, where you produce the entire response to the context at once. This typically produces worse accuracy scores because there are interdependencies within the text being produced - a model translating "thank you" could say "vielen danke" or "danke schön" but whereas an autoregressive model can know which word to say next based on previous decoding, a non-autoregressive model can't do this, so also could produce "danke danke" or "vielen schön". There is some research that suggests you can close in on the accuracy gap though: https://arxiv.org/abs/2012.15833.

answered Feb 22 '21 at 22:38

Ben

71
1

Wouldn't positional encoding help it produce it in the right order? Or they would merely help, but not guarantee the right order? – Kari Feb 27 '21 at 11:44
1

The problem is that you're decoding everything in parallel, so even if you apply positional encoding at the beginning, the token at position 1 needs to know what the token at position 2 is going to predict or vice versa, because in the above example "danke" would be appropriate both in the first and second position. So although it is helpful, it doesn't completely solve the issue. – Ben Mar 10 '21 at 15:59
It seems my rep is too low to edit directly, so: actually "vielen dank" would be correct in German. This means your second example needs to be "danke dank". You could edit to avoid this minor distraction. – Mathias Müller Mar 18 '21 at 08:10

Can the decoder in a transformer model be parallelized like the encoder?

3 Answers3

Linked