Why can decoder-only transformers be so good at machine translation?

Question

In my understanding encoder-decoder transformers for translation are trained with sentence or text pairs. How can it be explained in simple (high-level) terms that decoder-only transformers (e.g. GPT) are so good at machine translation, even though they are not trained on sentence or text pairs but only on unrelated multilingual training data? Why can decoder-only transformers do without it? Or did I get something wrong?

Are the documents in the training data containing accidentally sentence pairs near each other possibly enough?

score 0 · Answer 1 · answered Jul 29 '23 at 15:28

Docoder-only transformers are also trained on text pairs. You first pre-train on general text data. After that you fine tune with a standard two language dataset containing example translations.

The trick is that you concatenate the source and the target sentence, you put a special separator token, and you treat the task as language modelling, i.e. you ask the model to predict the next word one-by-one.

Input: ["Good", "morning", "everyone", "<SEP>", "Guten", "Morgen", "zusammen", "<END>".]
Decoder output: [$z_1, z_2, z_3, z_4, z_5, z_6, z_7, z_8$]
You then compute the loss by considering only the outputs from $z_5$ to $z_8$.

Why can decoder-only transformers be so good at machine translation?

1 Answers1