0

Text classification of equal length texts works without padding, but in reality, practically, texts never have the same length.

For example, spam filtering on blog article:

thanks for sharing    [3 tokens] --> 0 (Not spam)
this article is great [4 tokens] --> 0 (Not spam)
here's <URL>          [2 tokens] --> 1 (Spam)

Should I pad the texts on the right:

thanks for     sharing --
this   article is      great
here's URL     --      --

Or, pad on the left:

--   thanks  for    sharing
this article is     great
--   --      here's URL

What are the pros and cons of either pad left or right?

Dee
  • 1,283
  • 1
  • 11
  • 35

1 Answers1

1

For any model that does not take a time series approach like an RNN does, the padding shouldn't make a difference.

I prefer padding right simply because there also might be text you need to cut-off. Then padding is more intuitive as you either cut-off a text if it's too long or pad a text when it's too short.

Either way, when a model is trained a certain way, it shouldn't make a difference so long the testing is also padded the way it was presented in training.

N. Kiefer
  • 311
  • 3
  • 9
  • the problem with padding right is the last value for time of RNN unit is mostly always zero, i tackle it by return sequence from the recurrent layer, so it's not affected by the last zero pad. supposed to be correct? – Dee Jul 03 '20 at 07:47