How to detect and deal with data distribution drift/change?

Question

I'm working on a problem in ML to assess the performance of multiple vendors. I have a set of features in my dataset, and it appears each vendor is characterized by its own distribution. This is my hypothesis, as I see my target variable shifts and changes value ranges with time, as different vendors' data is achieved.

Is there a way, in a real application system, to implement a mechanism to detect such data drift/change (whether in target variable or features distribution)?

If so, how can I deal with it? Should I be constantly re-training a model to deal with the new data? What is the common practices to deal with new observed data?

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — Community, Jul 23 '22 at 08:24

score 1 · Answer 1 · answered Jul 23 '22 at 12:40

Happy to give a general approach, we would need more context to pinpoint the actual problem.

In general you should try to understand the data before applying any model. If you are using Neural Networks the distribution of the data should not be an issue. You should first check for:

Outliers
Correlation between variables
Scale of the features
Size of the different populations (you mentioned you are analysing multiple vendors, do you have the same amount of data for each?)
Missing values
Data corruption

The different distribution shapes might indicate an issue with the data, e.g. some vendors might not record the data accurately or as often as others. If you think that there are no underlying issues, then you could try:

Use the average to differentiate the vendors (i.e. median if the distribution is heavily skewed, mean if its a normal distribution)
Use ANOVA to classify each vendor
Scale your features so that each fall within the same range, or try to centre their distributions

A good model must also be tested for its generalization performance (i.e. how well it responds to new data points), this has more to do with not overfitting the model during the training process.

How to detect and deal with data distribution drift/change?

1 Answers1