What does it mean that the decoder can be parallelized during training?
Let's assume a transformer (with both encoder and decoder) is employed for a time-series prediction. I.e. from the input sequence x_0, ..., x_N
we want to predict y_0, ..., y_N
. Is this the way that parallelization occurs during training?
- form the batch
[], [y_0], ..., [y_0, ..., y_N-1]
- feed this batch to the transformer, together with the input sequence
- we will obtain the batch
Y_0, Y_1, ..., Y_N
- compare against
y_0, ..., y_N
and form the loss (*)
(*) here, some teacher ratio techiques may be employed, so that more passes may be required