How to handle class imbalance when the actual data are that way

Question

My supervised learning training data are obtained from actual data; and in real cases, there's one class that happens less often than other classes, just around 5% of all cases.

To be precise, the first 2 classes are in 95% of training data and the last one is in 5%. Training while keeping the data ratio intact will make accuracy reach 50% at the right first step and reaches 90%+ immediately that doesn't make sense.

Should I exclude some data of classes 1 and 2, to make the numbers of samples of 3 classes equal? But it's not a real-world ratio.

score 2 · Accepted Answer · edited Dec 11 '21 at 08:50

You can use stratified cross-validation combined with an imbalanced learning technique applied to the training data. Stratification ensures that when you split your data into train and test, the ratio of frequencies between the classes will stay the same, and therefore the test data will always be "realistic".

However, when training a model (using only the training data, of course), the imbalance may have a negative impact. Therefore, have a look at some of the imbalanced learning techniques that are out there to remedy this situation. For example, you could try these:

random undersampling: discard random examples from the majority classes until the ratios of class frequencies are close to 1
random oversampling: make random duplicates of minority class examples until the ratios of class frequencies are close to 1
SMOTE: like random oversampling, except that synthetic examples are created instead of random duplicates
balanced bagging: performs random undersampling but does so multiple times to create an ensemble of models trained on balanced subsets of the training data

etc.

You should also take care about the metrics you use to assess predictive performance on the test data. Accuracy could be misleading here, so you may instead find metrics like sensitivity and specificity (calculated for each class individually) more informative.

How to handle class imbalance when the actual data are that way

1 Answers1

Linked