4

This is a question related to Neural network to detect "spam"?. I'm wondering how it would be possible to handle the emotion conveyed in text. In informal writing, especially among a juvenile audience, it's usual to find emotion expressed as repetition of characters. For example, "Hi" doesn't mean the same as "Hiiiiiiiiiiiiiii" but "hiiiiii", "hiiiiiiiii", and "hiiiiiiiiii" do.

A naive solution would be to preprocess the input and remove the repeating characters after a certain threshold, say, 4. This would probably reduce most long "hiiiii" to 4 "hiiii", giving a separate meaning (weight in a context?) to "hi" vs "long hi".

The naivete of this solution appears when there are combinations. For example, haha vs hahahahaha or lol vs lololololol. Again, we could write a regex to reduce lolol[ol]+ to lolol. But then we run into the issue of hahahaahhaaha where a typo broke the sequence.

There is also the whole issue of Emoji. Emoji may seem daunting at first since they are special characters. But once understood, emoji may actually become helpful in this situation. For example, may mean a very different thing than , but may mean the same as and .

The trick with emojis, to me, is that they might actually be easier to parse. Simply add spaces between to convert to in the text analysis. I would guess that repetition would play a role in training, but unlike "hi", and "hiiii", Word2Vec won't try to categorize and as different words (as I've now forced to be separate words, relying in frequency to detect the emotion of the phrase).

Even more, this would help the detection of "playful" language such as , where the emoji might imply there is anger, but alongside and especially when repeating multiple times, it would be easier for a neural network to understand that the person isn't really angry.

Does any of this make sense or I'm going in the wrong direction?

hjf
  • 201
  • 1
  • 6

2 Answers2

1

These kinds of repetitions in text can place recurrence demands on learning algorithms that may or may not be handled without special encoding.

  • Hi.
  • Hiiii!
  • HIIIIIIIII
  • Hi!!!!!!!!!!!!!!

These have the same meaning on one level, but different emotional content and therefore different correlations to categories when detecting the value of an email, which in the simplest case is the placement of a message in one of two categories.

  • Pass to a recipient
  • Archive only

This is colloquially called spam detection, although not all useless emails are spam and some messages sent by organizations that broadcast spam may be useful, so technically the term spam is not particularly useful. The determinant should usually be the return on investment to the recipient or the organization receiving and categorizing the message.

Is reading the message and potentially responding likely of greater value than the cost of reading it?

That is a high level paraphrase of what the value or cost function must represent when AI components are employed to learn about or track close to (in continuous learning) some business or personal optimality.

The question proposes a normalization scheme that truncates long repetitions of short patterns in characters, but truncation is necessarily destructive. Compression of some type that will both preserve nuance and work with the author's use of Word2Vec is a more flexible and comprehensive approach.

In the case of playful sequences of characters it is anthropomorphic to imagine that an artificial network will understand playfulness or anger, however existing learning devices can certainly learn to use character sequences that humans would call playful or angry in the function that emerges to categorize the message containing them. Just remember that model free learning is not at all like cognition, so the term understanding is placing an expectation on the mental capacities of the AI component that the AI component may not possess.

Since no indication that a recurrent or recursive network will be used but rather the entire message is represented in a fixed width vector, so the question becomes which of these two approaches will produce the best outcomes after learning.

  • Leaving the text uncompressed so that an 'H' character followed by ten 'i' characters is distinct as a word from an 'H' character followed by five 'i' characters
  • Compressing the text to "Hi [9xi]" and "Hi [4xi]" respectively or some such word bifurcation.

This second approach produces reasonable behavior with other cases mentioned, such as "" pre-processed into " [2x]". What the algorithm in Word2Vec will do with each of these two choices and how its handling of them will affect outcomes is difficult to predict. Experiments must be run. Three things are advisable courses of action.

  • Build a test fixture to allow quick evaluation of outcomes for various trials.
  • Experiment with dilligence. Don't leave any potentially interesting case untried.
  • Label as much production data as reasonably possible and use that as well as the canned data so that the above options can be evaluated in permuted combinations with the differences in pattern distribution between the canned and live data.
Douglas Daseeco
  • 7,423
  • 1
  • 26
  • 62
0

My sense is that this would require a statistical approach with a large dataset.

The algorithm would need to "translate" the slang into formal terms (discrete words or phrases expressing a single concept.)

The trick would be vetting the algorithms decisions, which would require a sufficient sample of humans to evaluate the given translation for each novel instance of slang. (This would likely require some form of crowdsourcing, similar to Captcha.)

This would determine whether 4⇔ 5⇔ 6 where 4x, 5x, 6x of the symbol are equivalent, and whether spacing between the emoji's is meaningful.

Most likely these would be fuzzy associations in that the same slang can be interpreted differently by different people, and the meaning can vary when used in different contexts:

could mean "I'm laughing super-hard because what you say is so absurdly incorrect." [Adversarial]

could mean "I'm laughing super-hard because the joke is extremely funny." [Cooperative]

Informally, my experience of 5 has always been in adversarial, but that could be a function of the contexts in which I've encountered it, which reinforces the need for a large samples.

It occurs to me that you could reduce the sample size by using a friendly chatbot that parses social media posts for any symbolic information that is non-standard, then queries the posters asking for clarification. (This way, you'd get the intent of the slang from the person using it, as opposed interpretations of those viewing it.)

For informal text (as opposed to emojis) the algorithm would want to be able to distinguish between intentional or unintentional mis-spellings.

DukeZhou
  • 6,237
  • 5
  • 25
  • 53