How do you measure multi-label classification accuracy?

Question

Multi-label assignment is the task in machine learning to assign to each input value a set of categories from a fixed vocabulary where the categories need not be statistically independent, so precluding building a set of independent classifiers each classifying the inputs as belong to each of the categories or not.

Machine learning also needs a measure by which the model may be evaluated. So this is the question how do we evaluate a multi-label classifier?

We can’t use the normal recall, accuracy and F measures since they require a binary is it correct or not measure of each categorisation. Without such a measure we have no obvious means to evaluate models nor to measure concept drift.

I like this discussion of accuracy: https://stats.stackexchange.com/q/312780/247274. Two example of the strictly proper scoring rules that Kolassa discusses are crossentropy loss (log loss) and Brier score. — Dave, Oct 28 '20 at 20:04

score 0 · Answer 1 · answered Oct 29 '20 at 03:50

0

Your intuition is correct. We do use other metrics for multi-label classification. The meaning of evaluation itself changes. Apart from grading the classifier on whether it classifies correctly or not, we also have to penalize it, if it chooses the wrong class appropriately. You could use the following metrics:

micro/macro averaging Recall/Precision, etc.
Hamming Loss
Subset Accuracy

answered Oct 29 '20 at 03:50

Saurav Maheshkar

756
1
7
20

Maybe you should briefly describe how they are calculated (or what's the intuition behind them) or, at least, provide a link to a research paper (or reliable article) that explains these metrics/measures. – nbro Oct 29 '20 at 11:39

score 0 · Answer 2 · answered Mar 29 '21 at 16:08

Even with a binary classifier, one number does not fully represent the behaviour - the confusion matrix has three degrees of freedom. Even more with a multi-class problem, it is best to print our the whole confusion matrix. Then you can pick up problems like "large class A is well classified, many Bs are wrongly classified as C, and the few Ds are wrongly assigned to A,B, or C".

Even better, printing the confusion matrix helps you think about what the real business goals are: which of these errors matters most in practice?

How do you measure multi-label classification accuracy?

2 Answers2