Text classification of non-equal length texts, should I pad left or right?

Question

Text classification of equal length texts works without padding, but in reality, practically, texts never have the same length.

For example, spam filtering on blog article:

thanks for sharing    [3 tokens] --> 0 (Not spam)
this article is great [4 tokens] --> 0 (Not spam)
here's <URL>          [2 tokens] --> 1 (Spam)

Should I pad the texts on the right:

thanks for     sharing --
this   article is      great
here's URL     --      --

Or, pad on the left:

--   thanks  for    sharing
this article is     great
--   --      here's URL

What are the pros and cons of either pad left or right?

score 1 · Accepted Answer · answered Jul 03 '20 at 07:35

1

For any model that does not take a time series approach like an RNN does, the padding shouldn't make a difference.

I prefer padding right simply because there also might be text you need to cut-off. Then padding is more intuitive as you either cut-off a text if it's too long or pad a text when it's too short.

Either way, when a model is trained a certain way, it shouldn't make a difference so long the testing is also padded the way it was presented in training.

answered Jul 03 '20 at 07:35

N. Kiefer

311
3
9

the problem with padding right is the last value for time of RNN unit is mostly always zero, i tackle it by return sequence from the recurrent layer, so it's not affected by the last zero pad. supposed to be correct? – Dee Jul 03 '20 at 07:47

Text classification of non-equal length texts, should I pad left or right?

1 Answers1