Xu et al. (2022) distinguishes between popular pre-training methods for language modeling: (see Section 2.1 PRETRAINING METHODS)
- Left-to-Right:
Auto-regressive, Left-to-right models, predict the probability of a token given the previous tokens.
- Encoder-Decoder:
An encoder-decoder model first uses an encoder to encode an input sequence, and then uses a left-to-right LM to decode an output sequence conditioned on the input sequence.
My question is, what are the differences between those two methods? Do they suggest that the first method is a decoder-only? If so, what is the input to this decoder?
Based on what I know about auto-regressive models and the above definition, I understand that in Left-to-Right, we predict the $i$-th token given the $1,...,i-1$ tokens (which could be our past predictions).