I'm working on a project where there is a limited dataset of videos (about 200). We want to train a model that can detect a single class in the videos. That class can be of multiple different types of shapes (thin wire, a huge area of the screen, etc).
There are three options on how we can label this data:
- Image classification (somewhere in the image is this class)
- Bounding box (in this area, there is the class)
- Semantic segmentation (these pixels are the class)
My assumption is that if the model was trained on semantic segmentation data it would perform slightly better than bounding box data. I'm also assuming it would perform way better than if the model only learned on image classification data. Is that correct?