Suppose we have $1000$ products that we want to detect. For each of these products, we have $500$ training images/annotations. Thus we have $500,000$ training images/associated annotations. If we want to train a good object detection algorithm to recognize these objects (e.g. YOLO) would it be better to have multiple detection models? In other words, should we have 10 different YOLO models where each YOLO model is responsible for detecting 100 products? Or is it good enough to have one YOLO model that can detect all 1000 products? Which would be better in terms of mAP/recall/precision?

- 39,006
- 12
- 98
- 176

- 113
- 5
-
If you have 10 YOLO models that detect 100 different products and you need to detect a product how will you know which model was trained for that specific product ? You would have to wastefully process that product with 9 YOLO models. In that case you might as well make an ensemble of object detectors instead of training more of them on different objects. If we talk about only 1 YOLO model, authors in their YOLO v2 paper claim that they can detect over 9000 object categories so it might not be unreasonable to expect that YOLO v3 could learn 1000 different objects on its own. – Brale Nov 24 '19 at 10:58
-
Added another correct answer @Prime Number – Clement Nov 29 '19 at 01:06
1 Answers
This is called decomposition of multi-class classifier. Your proposed method is called one vs all.
One vs. all provides a way to leverage binary classification. Given a classification problem with $N$ possible solutions, a one-vs.-all solution consists of $N$ separate binary classifiers—one binary classifier for each possible outcome. During training, the model runs through a sequence of binary classifiers, training each to answer a separate classification question.
Source: https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/one-vs-all.
According to this article. The author of the article did experiments on SVM on 8 different benchmark problems. According to the results, this method is sometimes as good as others, but usually not the best. It is also never substantially better than any other method. The article also stated that the best method is usually problem dependent.
Also, this method will decrease inference speed a lot, and used substantial amount of GPU memory. According to the source, it does not improve performance a lot, so you best bet for getting a higher performance is probably to use a different model architecture, for example the FPN FRCN, which is stated in the YOLO v3 paper having the best performance, but not fast inference speed. YOLOv3 is designed to have a fast inference speed, to provide real time object detection system, so for performance you should probably use other model architecture instead.