0

While fine-tuning a decoder only LLM like LLaMA on chat dataset, what kind of padding should one use?

Many papers use Left Padding, but is right padding wrong since transformers gives the following warning if using right padding " A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer."

The attention mask will anyways ignore the padding tokens.

nbro
  • 39,006
  • 12
  • 98
  • 176

2 Answers2

0

I'm concerned about this as well, and the generate() method in the transformers library explicitly suggests that decoder-only models should use the left padding method. Is there some correlation? I would also like to know the reason for this left padding. enter image description here

0

I got an answer to this question, probably a correct explanation.

In decoder-only model architectures, the output of the model is a continuation of the model input.

For example, input: I love apple [pad] [pad]. The output of the model will contain the input and add additional output information.

For example, output: I love apple [pad] [pad],because it is delicious.

This would result in [pad] being stuck in the middle of the text. It is very bad for the model to process text. If we use left-padding, the output of this model will be

output : [pad] [pad] i love apple,because it is delicious.

Such complete semantic information is continuous.

ref:https://github.com/huggingface/transformers/issues/18388#issuecomment-1204369688