How do I select the (number of) negative cases, if I'm given a set of positive cases?

Question

We were given a list of labeled data (around 100) of known positive cases, i.e. people that have a certain disease, i.e. all these people are labeled with the same class (disease). We also have a much larger amount of data that we can label as negative cases (all patients that are not on the known positive patient's list).

I know who the positives are, but how do I select negative cases to create a labeled dataset of both positives and negatives, on which to first train a neural network, and then test it?

This is a common problem in the medical field, where doctors have lists of patients that are positive, but, in our case, we were not given a specific list of negative cases.

I argued for picking a number that represents the true prevalence of the disease (around 1-2%). However, I was told that this isn't necessary and to do a 50:50 split of positives to negatives. It seems that doing it this way will not generalize outside our test and train datasets.

What would you do in this case?

I worry that changing the class imbalance could give a false improvement in accuracy. — Otto, Dec 20 '20 at 14:57
It will, but accuracy is a bad choice of metric for such an imbalanced dataset. For medical diagnoses you also will care a lot about false positive rate (it is common to get more false positive than true positives with such imbalances which may lead to lots of unwanted stress and non-necessary treatments). Hopefully someone can answer with more detail. — Neil Slater, Dec 20 '20 at 15:24
Thanks for your thoughts. Do you think an F1 would be better? — Otto, Dec 20 '20 at 18:25
I think someone who knows more about the context here should answer. I have edited in salient detail to the title and this is a common enough problem that there is a tag for it, too — Neil Slater, Dec 20 '20 at 20:07

user3667125 · Accepted Answer · 2020-12-21T03:15:07.847

Short answer

To select the proper dataset to construct, you should first figure out a metric to use to compare, and then select the dataset construction that gives the better metric. There is no single best metric, it depends on the task and your interpretation on what type of error is more important.

If you believe it is important that errors should not be normalized across class, then use the overall accuracy, and keep your dataset distribution same as the natural distribution (so 1-2% positive cases).

If you believe it is important that errors should be normalized across class, then use PR-AUC or ROC-AUC, and re-balance your dataset so that the samples a little more closer to 1:1. The exact ratio will only be determined after testing and comparing the PR-AUC or ROC-AUC metrics.

How to select the best metric?

Two popular metrics are ROC-AUC and PR-AUC. ROC curves (Receiver Operating Characteristics) plot the true positive rate vs false positive rate, while PR curves (Precision and Recall) plot the precision vs recall. AUC stands for "area under curve", because you can achieve any single point in the curve by specifying the classifier threshold, so the sum of all points (i.e. the entire area under the curve) is the most general way of comparing if one model is doing better than another.

Although both ROC curves and PR curves equalize class imbalance at some capacity, PR curves are more sensitive to class imbalance. The paper The Relationship Between Precision-Recall and ROC Curves concludes that if the PR-AUC is good then the ROC-AUC will always also be good, but not the other way around. The difference is due to the fact that if the dataset has huge class imbalance, a false positive hurts PR curves significantly more than ROC curves.

On the other hand, total accuracy does not normalize class imbalance at all, so therefore favors the majority class.

As a result:

if you do not care about normalizing the measures of class imbalance, choose total accuracy, which will optimize for the most # of correct cases (regardless of class)
if you want to normalize your metric across class imbalance, and normalizing false positive errors across classes is at all important to you, choose PR-AUC
if you want to normalize your metric across class imbalance, and don't care about normalizing false positive errors, PR-AUC or ROC-AUC may both be good for you

If it helps, for most imbalance problems, people usually go for PR curves.

By the way, (this paper) studies class imbalance in neural networks by optimizing the ROC curves, and show that you should definitely have equal numbers of positive and negative examples. So if you want the best performance in terms of ROC-AUC, you should do the 50:50 split. I haven't read any similar study that optimizes for PR-AUC, but my intuition tells me that it will have the same conclusion (you should do 50:50 split to optimize for PR-AUC as well).

How do I select the (number of) negative cases, if I'm given a set of positive cases?

1 Answers1

Short answer

How to select the best metric?