How to source training data in ML for information security?

Question

A company entrusts a Data Scientist with the mission of processing and valuing data for the research or treatment of events related to traces of computer attacks. I was wondering how would he get the train data.

I guess he would need to exploit the logs of the different devices of the clients and use statistical, Machine Learning and visualization techniques in order to bring a better understanding of the attacks in progress and to identify the weak signals of attacks... But how would he get labelled data?

He might get the logs of attacks received before, but that might not have the same signature with the attacks that are going to come later? So it might be difficult to create a reliable product?

score 2 · Accepted Answer · answered Apr 18 '21 at 09:49

You have implicitly assumed that supervised learning is being used, given the assumption that labels are needed. But this might lead to the following potential problems:

Log file data tends to be huge, and it may be infeasible to label due to the time/expertise required;
Then there's the class imbalance problem, in that attack examples are far far rarer than "normal" behaviour, and this can mess up supervised models during both learning and evaluation stages;
And finally even if the data is labelled, a supervised model is unlikely to be useful in detecting completely novel attacks because it has not been trained to recognise these.

I think a far easy way to approach these kinds of problem would be unsupervised learning: model normal patterns and behaviours in the logs, and then flag any deviations from normality. It may be an attack or it may be new normal behaviour. In the latter case, the model can be updated. There are various approaches here such as clustering, outlier detection and possibly self-supervised learning that might be useful. Dimensionality techniques might also be useful to visualise clusters of "normal" behaviour that can be compared to abnormal patterns.

We do a lot of work in this space. You can absolutely do a great job using supervised learning. Evaluation of reconstruction loss is just one example of how this can be accomplished using limited training data or possibly even only using examples of "good" activity. It's all about feature selection. — David Hoelzer, Apr 19 '21 at 23:55

How to source training data in ML for information security?

1 Answers1