While fine-tuning a decoder only LLM like LLaMA on chat dataset, what kind of padding should one use?

Question

Many papers use Left Padding, but is right padding wrong since transformers gives the following warning if using right padding " A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer."

The attention mask will anyways ignore the padding tokens.

Next time, please, put your question in the title. Thanks. – nbro Jul 26 '23 at 10:08 — nbro, Jul 26 '23 at 10:08

score 0 · Answer 1 · answered Jul 29 '23 at 11:13

0

I'm concerned about this as well, and the generate() method in the transformers library explicitly suggests that decoder-only models should use the left padding method. Is there some correlation? I would also like to know the reason for this left padding.

answered Jul 29 '23 at 11:13

尹雅博

16

Please create new question instead of write it as an answer. – CuCaRot Aug 09 '23 at 04:24

score 0 · Accepted Answer · answered Jul 30 '23 at 05:23

I got an answer to this question, probably a correct explanation.

In decoder-only model architectures, the output of the model is a continuation of the model input.

For example, input: I love apple [pad] [pad]. The output of the model will contain the input and add additional output information.

For example, output: I love apple [pad] [pad],because it is delicious.

This would result in [pad] being stuck in the middle of the text. It is very bad for the model to process text. If we use left-padding, the output of this model will be

output : [pad] [pad] i love apple,because it is delicious.

Such complete semantic information is continuous.

ref:https://github.com/huggingface/transformers/issues/18388#issuecomment-1204369688

While fine-tuning a decoder only LLM like LLaMA on chat dataset, what kind of padding should one use?

2 Answers2