Find anomalies from records of categorical data

Question

I have a data-set with $m$ observations and $p$ categorical variables (nominal), each variable $X_1, X_2,\dots, X_p$ has several different possible values.

Ultimately, I am looking for a way to find anomalies i.e. to identify rows for which the combination of values seems incorrect with respect to the data I saw so far.

So far, I was thinking about building a model to predict the value for each column and then build some metric to evaluate how different the actual row is from the predicted row.

I would greatly appreciate any help!

score 2 · Answer 1 · answered Jun 28 '18 at 17:43

First of all, you mention that you have categorical data. I don't see how you can define similarity so that you can also define the distance between the predicted value and the ground truth (error). You can do that only if the data are ordinal.

If you want to just classify between normal and anomalous points (binary classification), without caring about further classification of the anomaly types themselves, one of the most common algorithms is the One-Class Support Vector Machine (OC-SVM).

Anomalies are unpredictable in nature and sometimes hard to replicate and record. Therefore, there is usually lack of anomalous data and supervised learning approaches suffer because if you sacrifice some "precious" anomalous points to train the algorithm, you cannot use them to test it.

The main advantage of OC-SVM is that it is semi-supervised learning, meaning that you train it only with normal data and then it can detect samples that deviate from the trained behaviour during testing and classify them as anomalous. Thus, you "save" all the rare anomalous points for testing purposes!

Take a look at this short Python example, it has all you need :)

Find anomalies from records of categorical data

1 Answers1