In my understanding encoder-decoder transformers for translation are trained with sentence or text pairs. How can it be explained in simple (high-level) terms that decoder-only transformers (e.g. GPT) are so good at machine translation, even though they are not trained on sentence or text pairs but only on unrelated multilingual training data? Why can decoder-only transformers do without it? Or did I get something wrong?
Are the documents in the training data containing accidentally sentence pairs near each other possibly enough?