This is a question related to Neural network to detect "spam"?. I'm wondering how it would be possible to handle the emotion conveyed in text. In informal writing, especially among a juvenile audience, it's usual to find emotion expressed as repetition of characters. For example, "Hi" doesn't mean the same as "Hiiiiiiiiiiiiiii" but "hiiiiii", "hiiiiiiiii", and "hiiiiiiiiii" do.
A naive solution would be to preprocess the input and remove the repeating characters after a certain threshold, say, 4. This would probably reduce most long "hiiiii" to 4 "hiiii", giving a separate meaning (weight in a context?) to "hi" vs "long hi".
The naivete of this solution appears when there are combinations. For example, haha vs hahahahaha or lol vs lololololol. Again, we could write a regex to reduce lolol[ol]+ to lolol. But then we run into the issue of hahahaahhaaha where a typo broke the sequence.
There is also the whole issue of Emoji. Emoji may seem daunting at first since they are special characters. But once understood, emoji may actually become helpful in this situation. For example, may mean a very different thing than , but may mean the same as and .
The trick with emojis, to me, is that they might actually be easier to parse. Simply add spaces between to convert to in the text analysis. I would guess that repetition would play a role in training, but unlike "hi", and "hiiii", Word2Vec won't try to categorize and as different words (as I've now forced to be separate words, relying in frequency to detect the emotion of the phrase).
Even more, this would help the detection of "playful" language such as , where the emoji might imply there is anger, but alongside and especially when repeating multiple times, it would be easier for a neural network to understand that the person isn't really angry.
Does any of this make sense or I'm going in the wrong direction?