Can we use bounding box cropping to avoid shortcut learning (achieve explainable AI)?

Question

Deep neural networks sometime use shortcut features (pseudo correlation) to predict.

For example, in cat-dog classification, the network may use the background information (e.g. floor, grass) as a shortcut. In lung cancer classification, the network may use features beyond the lung region (see Fig-2).

Such behaviors make the prediction unreliable (and not explainable to human). Hence, in shortcut learning, people study various techniques to avoid shortcut features.

Here, my question is: Why not we just use a detection model (e.g. FastRCNN) to first detect the object (e.g. cat/dog, cancer) from the image, then predict the label based on the cropped image (where the background is masked)?

This naive method must have some problems, otherwise it is not necessary to develop shortcut learning/robust learning papers.

From this post, I find some possible answers:

training two networks (detection, classification) is costly in time.
the detection model may generate mistaken bounding box.

However, I feel these answers are more from a technical aspect. None of them points out the intrinsic problem of the naive method.

Any suggestion, reference paper, etc. are welcomed.

Well, to do so you need to be aware that the background is problematic: there could be some other kind of images for which is not easy to say which "high-level features" could cause a shortcut. Also the bounding-box may be imprecise and contain irrelevant information: to this end a segmentation mask may be better. — Luca Anzalone, Aug 29 '23 at 11:02

Can we use bounding box cropping to avoid shortcut learning (achieve explainable AI)?

0 Answers0