Deep neural networks sometime use shortcut features (pseudo correlation) to predict.
For example, in cat-dog classification, the network may use the background information (e.g. floor, grass) as a shortcut. In lung cancer classification, the network may use features beyond the lung region (see Fig-2).
Such behaviors make the prediction unreliable (and not explainable to human). Hence, in shortcut learning, people study various techniques to avoid shortcut features.
Here, my question is: Why not we just use a detection model (e.g. FastRCNN) to first detect the object (e.g. cat/dog, cancer) from the image, then predict the label based on the cropped image (where the background is masked)?
This naive method must have some problems, otherwise it is not necessary to develop shortcut learning/robust learning papers.
From this post, I find some possible answers:
training two networks (detection, classification) is costly in time.
the detection model may generate mistaken bounding box.
However, I feel these answers are more from a technical aspect. None of them points out the intrinsic problem of the naive method.
Any suggestion, reference paper, etc. are welcomed.