The Project Summarized
The project goal appears to be a common one: Routing correspondence in an efficient manner to maintain good but low cost customer and public relations. A few features of the project were mentioned.
- Neural network project
- Received some design and project history from predecessor
- Classifies messages for telcos
- Sends results to support groups at appropriate locales
- Uses 2 relu layers, ending with a softmax
- Word2Vec embedding
- Trained with a clean language file
- All special characters and numbers removed
The requirements for current development were indicated. The current work is to develop an artificial network that places incoming messages into one of two categories accurately and reliably.
- Moderated — insulting, fraudulent in purpose (spam), trivial routine
- Operative — relevant question requiring internal human attention
Research and development is beginning along reasonable lines.
- Trained with 300,000 messages
- Word2Vec used
- 40% of classified as moderated
- Permuted cycles and epochs
- Achieved 90% accuracy
- Loss stays near 0.5
- In test, operative accuracy 0.9, moderated accuracy max of 0.6
First Obstacle and Feasibility
The first obstacle encountered is that in QA using production environment data, 90% of the messages where left unclassified, 5% of the classifications were accurate, and the remaining 5% were inaccurately classified.
It is correct that the even split of 5% accuracy and 5% inaccuracy indicates that information learned is not yet transferable to the quality assurance test phase using real production environment messages. In information theory phraseology, no bits of usable information were transferred and entropy remained unchanged on this first experiment.
These kinds of disappointments are not uncommon when first approaching the use of AI in an existing business environment, so this initial outcome should not be taken as a sign that the idea won't work. The approach will likely work, especially with foul language, which is not dependent on cultural references, analogies, or other semantic complexity.
Recognizing notices that are for audit purposes only, from a social network accounts or purchase confirmations, can be handled through rules. The rule creation and maintenance can theoretically be automated too, and some proprietary systems exist that do exactly that. Such automation can be learned using the appropriate training data, but real time feedback is usually employed, and those systems are usually model based. That is an option for further down the R&D road.
The scope of the project is probably too small, but that's not a big surprise either. Most projects suffer from early overoptimism. A pertinent quote from Redford's The Melagro Beanfield War illuminates the practical purpose of optimism.
APPARITION
I don't know if your friend knows what he's in for.
AMARANTE
Nobody would do anything if they knew what they were in for.
Initial Comments
It is not necessary to reduce the number of message categories to two, but there is nothing wrong with starting R&D by refining approach and high level design with the simplest case.
The last layer may be more training efficient if a binary threshold is used for the activation function instead of softmax, since there is only one bit of output needed when there are only two categories. This also forces the network training objective to be the definitive selection of a category, which may benefit the overall rate of R&D progress.
There may be ways of improving outcomes by adding more metrics in the code to beyond just 'accuracy'. Others who work with such details every day may have more domain specific knowledge in this regard.
Culture and Pattern Detection
Insults and curse words are entirely different kinds of things. Foul language is a linguistic symbol or phrase that fits into a broadcasting or publishing category of prohibition. The rules of prohibition are well established in most languages and could be held in a configuration file along with the permutations of each symbol or phrase. In the case of sh*t, related forms include sh*tty, sh*thead, and so on.
It is also useful to distinguish the sub-sets of foul language.
- Cursing (expressing the wish for calamity to befall the recipient)
- Swearing (considered blasphemy by some)
- Exclamations that are considered foul by publishers and broadcasters
- Additional items parents don't want their children to hear
- Edge cases like crap
The term foul language is a super-set of these.
Distribution Alignment
Learning algorithms and theory are based on probabilistic alignment of feature distributions between training and use. The distribution of training data must closely resembles the distribution found when the trained AI component is later used. If not, the convergence of learning processes on some optimal behavior defined by gain or loss functions may succeed but the execution of that behavior in the business or industry may fail.
Internationalization
Multilingual AI should usually be fully internationalized. Training and use of training with two distinct dialects will almost always perform poorly. That creates a data acquisition challenge.
As stated above, classification and learning depend on the alignment of statistical distributions between data used in training and data processing relying on the use of what was learned. This is also true of human learning, so this requirement will not likely be overcome any time soon.
All these forms of foul language must be programmed flexibly across these cultural dimensions.
- Character set
- Collation order
- Language
- Dialect
- Other locale related determinants
- Education level
- Economic strata
Once one of these is included in the model (which will be imperative) then there is no reason why the others cannot be included at little cost, so it is wise to begin with standard dimensions of flexibility. The alternative will likely lead to costly branching complexity to represent specific rules, which could have been made more maintainable by generalizing for international use up front.
Insult Recognition
Insults require comprehension beyond the current state of technology. Cognitive science may change that in the future, but projections are mere conjecture.
Use of a regular expression engine with a fuzzy logic comparator is achievable and may appease the stakeholders of the project, but identifying insults may be infeasible at this time, and the expectations should be set with stakeholders to avoid later surprises. Consider these examples.
- The nose on your face looks like a camel.
- Kiss the darkest part of my little white. (From Avatar screenplay)
The word combinations in these are not likely to be in some data set you can use for training, so Word2Vec will not help in these types of cases. Additional layers may assist with proper handling of the at least some of the semantic and referential complexity of insults, but only some.
Explicit Answers to Explicit Questions
Is it possible to accomplish this task with a neural network?
Yes, in combination with excellence in higher level system design and best practices for internationalization.
Is the structure of this neural network correct for this task?
The initial experiments look like a reasonable beginning toward what would later be correct enough. Do not be discouraged, but don't expect the first pass at something like this to look much like what passes user acceptance testing a year from now. Experts can't pull that rate of R&D progress off, unless they hack and cobble something together from previous work.
Are 300k messages enough to train the neural network?
Probably not. In fact, 300m messages will not catch all combinations of cultural references, analogies, colloquialisms, variations in dialect, plays on words, and games that spammers play to avoid detection.
What would really help is a feedback mechanism so that production outcomes are driving the training rather than a necessarily limited data set. Canned data sets are usually restricted in the accuracy of their probabilistic representation of social phenomena. None will likely infer dialect and other locale features to better detect insults. A Parisian insult may have nothing in common with a Creole insult.
The feedback mechanism must be based on impressions in some way to become and remain accurate. The impressions must be labelled with all the locale data that is reasonably easy to collect and possibly correlated to the impression.
This implies the use of rules acquisition, fuzzy logic control, reinforcement learning, or the application of naive Bayesian approaches somewhere appropriate within the system architecture.
Do I need to clean up the data from uppercase, special characters, numbers etc?
Numbers can be relevant. Because of historical events and religious texts, 13 and 666 might be indications of something offensive, respectively. One can also use numbers and punctuation to convey word content. Here are some examples of spam detection resistant click bait.
- I've got a 6ex opportunity 4u.
- Wanna 69?
- Values are rising 50%! We have 9 investment choices 4 you to check out.
The meaning of the term special character is vague and ambiguous. Any character in UTF-8 is legitimate for almost all Internet communications today. HTML5 provides additional entities beginning with an ampersand and ending with a semicolon. (See https://dev.w3.org/html5/html-author/charref.)
Filtering these out is a mistake. Spammers leverage these standards to penetrate spam detection. In this example, the stroke similarities of a capital ell (L) and those of the British pound symbol can be exploited to produce spam detection resistant click bait.
Removing special characters that fit within the Internet standards of UTF-8 and HTML entities will likely lead to disaster. It is recommended not to follow that part of the predecessor's design.
Regarding emoticons and other ideograms, these are linguistic elements that may represent in text encoding the volume, pitch, or tone modulation of phonetics, or they may represent face or body language. In many languages ideograms are used in place of words. For a global system running in parallel with the blogsphere, emoticons are part of linguistic expression.
For that reason, they are not significantly different than word roots, prefixes, suffixes, conjugations, or word pairs as linguistic elements which can also express emotion as well as logical reasoning. For the learning algorithm to learn categorization behavior in the presence of ideograms, the ideograms must remain in training features and later in real time processing of those features using the results of training.
Additional Information
Some additional information is covered in this existing post: Spam Detection using Recurrent Neural Networks.
Since spam detection is closely related to fraud detection, the spammer fraudulently acting like a relationship already exists with their recipients, this existing post may be of assistance too: Can we implement GAN (Generative adversarial neural networks) for classication problem like Fraud detecion?
Another resource that may help is this: https://www.tensorflow.org/tutorials/representation/word2vec