0

I am reading the paper "Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision", I do not understand the definition of AUSE metric in this sentence "but only in terms of the AUSE metric which is a relative measure of the uncertainty estimation quality." There is no explanation for this term in the paper and googling it didn't bring me anything useful either. Can anyone tell me what does AUSE stand for? Thank you so much!

1 Answers1

0

The paper you read refers to the "Uncertainty Estimates and Multi-Hypotheses Networks for Optical Flow" paper for the AUSE metric. AUSE stands for Area Under the Sparsification Error. There are 2 core concepts here c.f. Sparsification Plot and Sparsification Error (defined in Section 5.2)

Concept 1: Sparsification Plot

Here, we plot our "metric-of-interest" on the y-axis and "percentage-of-removed-samples" on the x-axis. The idea is to remove samples (hence the term sparsification) in descending order of uncertainty and to evaluate your "metric-of-interest" at each removal. Then, your models sparsification plot is compared to an "Oracles" sparsification plot. The "Oracles" plot removes samples in descending order of their error on the basis of your models predictions. The closer your models curve is to the "Oracles" curve, the better your model is in context of uncertainty-error correspondence (or uncertainty calibration).

Concept 2: Sparsification Error

But, how do you compare your models performance with other models since they all have different errors and hence different sparsification plots (also for their respective Oracles). The authors propose to simply calculate the difference between a model and its Oracle's sparsification plot (hence the term error) and calculate the area-under-the-curve (AUC) for that. Unlike most cases where high AUC is better, here a lower AUC is considered better.

One can stop reading here if they just need an answer to the OP's question. Rest of the content is just some of my thoughts on this.


My thoughts: Global Implication

For future readers of this post, the authors of the paper referenced by the OP refer to the AUSE metric as "relative" and instead propose their own "absolute" metric called "AUCE" - Area under the Calibration Curve. They argue that AUSE can be gamed if the models uncertainty is consistently underestimated. My understanding of this is that if your model assigns similar(or same?) uncertainty to each sample, then the removal process to compute the sparsification plot is just like an Oracle. And then the Sparisifcation Error is epsilon(~0). In spite of this theoretical edge case, I have seen other papers using the same concept as a sparsification plot though termed as a "Risk-Coverage" curve instead [e.g. Automatic segmentation with detection of local segmentation failures in cardiac MRI ]

My thoughts: Uncertainty calibration vs Uncertainty-Error Correspondence

As a side note, most folks use the term uncertainty-calibration, though in my opinion it should be termed as "uncertainty-error correspondence". Sstrictly speaking, it is counter to the definition of calibration for a predictive model. Historically, calibration is spoken of in context of probabilities of a particular event (The well-calibrated Bayesian, Journal of the American Statistical Association,1982). Since uncertainty (e.g. entropy, variance etc) values are not directly indicative of probabilities, even though they are derived from them, I believe it is semantically incorrect to say uncertainty calibration.