Transformer parallelization during training

Asked Mar 11 '23 at 22:25

Active Mar 11 '23 at 22:25

Viewed 74 times

What does it mean that the decoder can be parallelized during training?

Let's assume a transformer (with both encoder and decoder) is employed for a time-series prediction. I.e. from the input sequence x_0, ..., x_N we want to predict y_0, ..., y_N. Is this the way that parallelization occurs during training?

form the batch [], [y_0], ..., [y_0, ..., y_N-1]
feed this batch to the transformer, together with the input sequence
we will obtain the batch Y_0, Y_1, ..., Y_N
compare against y_0, ..., y_N and form the loss (*)

(*) here, some teacher ratio techiques may be employed, so that more passes may be required

asked Mar 11 '23 at 22:25

Lilla

1

Does this answer your question? [Can the decoder in a transformer model be parallelized like the encoder?](https://ai.stackexchange.com/questions/12490/can-the-decoder-in-a-transformer-model-be-parallelized-like-the-encoder) – Minh-Long Luu Mar 12 '23 at 01:42
In fact no, I asked my question after reading that post. @Minh-LongLuu – Lilla Mar 12 '23 at 08:06

Transformer parallelization during training

0 Answers0