Spam Detection using Recurrent Neural Networks

Question

I am working on this code for spam detection using recurrent neural networks.

Question 1. I am wondering whether this field (using RNNs for email spam detection) worths more researches or it is a closed research field.

Question 2. What is the oldest published paper in this field?

Quesiton 3. What are the pros and cons of using RNNs for email spam detection over other classification methods?

I don't think any type of linguistic pattern detection is a statistics problem. It is a semantic pattern recognition problem that relies on probability and statistics as part, but only part, of the mathematical, scientific, linguistic, and psychological tool set used to approach and design a solution to the problem. — Douglas Daseeco, Oct 15 '18 at 12:15

Seth Simba · Answer 1 · 2018-01-15T19:49:50.157

Initially, spam detection relied on simple rule based techniques to sort out spam. However following Paul Graham's famed article 'A Plan for Spam' the Naive Bayes approach became very popular to the point that it became regarded as the baseline for dealing with spam.

However following breakthroughs in deep learning, researchers have now turned their focus to neural networks to help them deal with the perenial problem of spam emails. Google recently reported that by introducing NN's to g-mail's spam filters. It took them from 99.5% to over 99.9% accuracy, suggesting that neural networks especially when used in conjunction with Bayesian classification may be effective for enhancing spam filters. You can refer to the link below to read about Google's success story https://www.wired.com/2015/07/google-says-ai-catches-99-9-percent-gmail-spam/

Developing a spam filter using neural networks is basically a classification problem. You need to follow the steps below to develop such a system. (Nikhil B 2016)

Collect a dataset of spam and legitimate email messages. Label these datasets. You can find email and spam datasets here http://csmining.org/index.php/spam-email-datasets-.html
Process these messages with feature extraction and vectorising techniques i.e. tf-idf vectorizer, word2vec, bag-of-words e.t.c.
Once you have vectorised the dataset succesfully, apply a supervised learning NN algorithm i.e. radial basis network, multi-layer perceptron (MLP) or backpropagation.
Train your labelled data-set on the neural network. Once training is complete you can use cross validation to calculate the precision of your trained model using the test dataset.

Some of the advantages of using NN's for spam detection over other methods include.

Neural networks have a higher accuracy of identifying spam as demonstrated by google.
They have a lower false positive rate compared to other methods such as rule based techniques.
Their main disadvantage is that they require specialised computing hardware to deploy.

Some old influential papers published in the field include.

Machine Learning Techniques in Spam Filtering (2004) http://ats.cs.ut.ee/u/kt/hw/spam/spam.pdf

Detecting Spam Blogs: A Machine Learning Approach (2006) https://www.aaai.org/Papers/AAAI/2006/AAAI06-212.pdf

A review of machine learning approaches to Spam filtering (2009) https://www.sciencedirect.com/science/article/pii/S095741740900181X

Douglas Daseeco · Answer 2 · 2019-01-01T21:23:10.047

Question 1. I am wondering whether this field (using RNNs for email spam detection) worths more researches or it is a closed research field.

Use of RNNs to detect spam grew out of the use of artificial networks to detect fraud in telecommunications and the financial industry as a result of the rise of attacks on long distance lines, ATMs, banks, and credit card systems in online and at data centers supporting physical points of sale.

Although basic RNN design has given way to the newer LSTM and GRU approaches and its variants and extensions, artificial networks are now one of the primary fraud detection technologies. The dominance of this fraud detection strategy extends to SPAM detection, with its close ties to fraudulence. The spammers present the appearance of a relationship with their recipients that does not exist.

The improvement on computing designs to recognize patterns in time series data and the application of those designs for fraud detection and countermeasures and the detection and routing or deletion of of unwanted incoming information will be a stable area of research and development for the foreseeable future.

Question 2. What is the oldest published paper in this field?

There is no oldest published paper. The first papers on RNN are given in this answer: Where can I find the original paper that introduced RNNs?, but the move from pattern based detection to artificial networks to stateful artificial networks was gradual. The earliest deployments of these networks in server side or client side solutions occurred before any papers were published on the specific topic of RNN use in spam detection.

Question 3. What are the pros and cons of using RNNs for email spam detection over other classification methods?

Spam also has a strong temporal element. What one considers undesirable spam in one year may be considered mission critical email a few years later, and vice versa. The performance in this space includes speed, accuracy, and reliability of classification, but also adaptation to changing user classification needs.

It is because of these four performance characteristics in tandem that stateful networks derived from RNNs are commonly used for spam detection. The need for gated learning and forgetting at the cell level to support the variable adaptivity makes the LSTM and GRU variants common choices.

Semantic document classification is riding on an emerging set of technologies, which are primarily artificial network designs that begin to broach the threshold of cognitive understanding of the text by storing linguistic structure in forms that allow analogy, comparison, and composition between them. Semantic algorithms that perform these operations on fuzzy associations in combination with recursive artificial networks may emerge as the dominant design as such designs are further developed.

References

Detecting Spam Blogs: A Machine Learning Approach, Pranam Kolari, 2006

Automated labeling of bugs and tickets using attention-based mechanisms in recurrent neural networks, Volodymyr Lyubinets et. al., 2018

Spam Filter Through Deep Learning and Information Retrieval, Weicheng Zhang, 2018

An Unsupervised Neural Network Approach to Profiling the Behavior of Mobile Phone Users for Use in Fraud Detection, Peter Burge, John Shawe-Taylor, Journal of Parallel and Distributed Computing, Volume 61 Issue 7, July 2001, pp 915-925

Intelligent junk mail detection using neural networks, Michael Vinther, June 2002

Mining for fraud, Margaret Weatherford, IEEE Intelligent Systems, 2002

Discovering golden nuggets: data mining in financial application, D Zhang, L Zhou, IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, Vol. 34, No. 4, November 2004

A Comprehensive Survey of Data Mining-based Fraud Detection Research, C Phua, V Lee, K Smith, R Gayler - Arxiv preprint arXiv, 2007