1

ImageNet dataset is an established benchmark for the measurement of the performance of CV models.

ImageNet involves 1000 categories and the goal of the classification model is to output the correct label given the image.

Researchers compete with each other to improve the current SOTA on this dataset, and the current state of the art is 90.88% top-1 accuracy.

If the images involved only a single object and background - this problem would be well-posed (at least from our perceptual point of view). However, many images in the dataset involve multiple objects - a group of people, a person, and an animal - the task of classification becomes ambiguous.

Here are some examples.

The true class for this image is bicycle. However, there is a group of people. The model that recognizes these people would be right from the human point of view, but the label would be wrong.

enter image description here

Another example is the fisherman with fish called tench. The model could have recognized the person, but be wrong.

enter image description here

So, my question is - how does much the performance of the best models of ImageNet reflect their ability to capture complicated and diverse image distribution, and how much the final result on the validation set is accidental. When there are multiple objects present on the image, the network can predict any of them. Prediction can match the ground truth class or can differ. And for the model, that happens to be luckier, this benchmark will show better performance. The actual quality can be the same, in fact.

  • Are these images really from ImageNet? I can't believe that images like these are in a dataset for image classification. Maybe you can provide a reference to the images, maybe a piece of code that loads ImageNet and displays them. I understand your concerns but the question "how much does the prediction of the model conform with one of the multiple labels" is a bit unclear, although I understand where you want to go with this question. Maybe you can reformulate it in a clearer way. – nbro Dec 23 '21 at 23:51
  • @nbro These are from ImageNet. The upper image was randomly sampled from `train dataset`. Image with fish is the first image in `train dataset`. Note, that ImageNet is available only by request, one cannot simply download it (not just due to the large size). – spiridon_the_sun_rotator Dec 24 '21 at 04:49
  • Ok. I understand the problem(s) that you bring up: we're trying to force the model to say to which class the object in the image belongs, while, in many images, there is clearly more than one object. However, I still think this question is not fully clear. Are you trying to understand how much the model is confident about its predictions when there are multiple objects? So, let's say that a model achieves 100%. Then let's say that it predicts that in the first image the class is "bicycle", but there's clearly also a human there too. – nbro Dec 24 '21 at 10:54
  • Maybe another model does not achieve 100% accuracy and says that in the first image there's a human rather than a bicycle, assuming that both of these classes are present in the dataset. Now, your question seems to be related to how the model learns the representations of bicycles and humans (in this case) and how far away are they. But, again, I still don't understand the type of answer you're looking for. – nbro Dec 24 '21 at 10:54
  • So, I see multiple potential questions here that are related to 1. the confidence of the model to predict exactly one class (when there are many), 2. the representations that the model learns when there are many objects in the image. Your last sentence "The actual quality can be the same, in fact." suggests that you think that if a model predicts there's a human or bicycle in the first image, the quality would be the same. Well, yes, in a way, this is true. – nbro Dec 24 '21 at 10:57
  • However, if the class that we assign to the image is "bicycle" if a model predicts humans, that's not what we wanted to hear from the model. So, it seems that the problem is also related to the labelling of the image and if it makes sense to perform image classification (i.e. classifying the image to belong to only 1 class) when the images may contain more than one object. So, I see one big problem, but many potential questions, and it's not clear to me the type of answer you're looking for, as I said. – nbro Dec 24 '21 at 11:03
  • @nbro - actually, I ask one question - is the ImageNet reliable estimate of how good the given model is. We can give the task to fit random labels, and It would be also a benchmark of performance. If there was a single object - the task is unambiguous and valid. In case of multiple objects - sensible tasks would be segmentation, detection but not classification. – spiridon_the_sun_rotator Dec 25 '21 at 08:11
  • Yes, but the problem is that you're saying "how good the given model is", but according to which metric? If we pass one or more images of an elephant to a model and say it's a cat, then the model will probably learn that an elephant is a cat. Not conceptually, but just based on the visual patterns of the image. – nbro Dec 25 '21 at 08:13
  • @nbro so, it is difficult to have a reliable measure of the computer vision model. Probably, the ability for fast few-shot learning on multiple datasets is a more reliable estimate of quality of the learned representation. – spiridon_the_sun_rotator Dec 25 '21 at 08:19
  • So, your question seems to be "Are the representations learned by models trained on ImageNet to solve image classification reliable enough, given that multiple images might have multiple objects?". If that's your question, which seems to be from your comment, then I would suggest that you formulate it in this way, then I will upvote, because I think this post is very interesting (but was a bit unclear, to me at least). Your answer, however, seems to go in another direction, i.e. that you can improve the accuracy by changing the task. So, that's why I thought your post wasn't clear enough. – nbro Dec 25 '21 at 08:26

1 Answers1

0

This question is studied in a recent research - Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels.

It is quite strange, that community has paid so little attention to this issue.

They have relabeled the original ImageNet with the help of crops and changed the task to the multilabel.

enter image description here

This strategy turned out to be quite beneficial and improved accuracy of ResNet50 on validation set. At that time 80.2% accuracy with ResNet50 was the best result reported.