What loss function should one use for object detection, knowing that the input image contains exactly one target object?

Question

What loss function should one use, knowing that the input image contains exactly one target object?

I am currently using MSE to predict the center of ROI coordinates and its width and height. All values are relative to image size. I think that such an approach does not put enough pressure on the fact those coordinates are related.

I am aware of the existence of algorithms like YOLO or UnitBox, and am just wondering if there might be some shortcut for such particular case.

score 1 · Answer 1 · edited Jan 02 '22 at 12:48

I am currently using MSE to predict the center of ROI coordinates and its width and height. All values are relative to image size. I think that such an approach does not put enough pressure on the fact those coordinates are related.

At first glance, this looks quite reasonable. Computer vision is not really my main area of expertise, so I did some googling around, and one of the first repositories I ran into does something very similar. It may be interesting for you to look into the code and the references in that repository in more detail.

It looks to me like they're also using the MSE loss function. I'm not 100% sure how they define the bounding boxes, maybe you can figure it out by digging through the code. You currently define bounding boxes by:

X coordinate of center of bounding box
Y coordinate of center of bounding box
Width of bounding box
Height of bounding box

You are right in that these coordinates are quite closely related. If the center is incorrect (for example, a bit too far to the right), that mistake could partially be "fixed" by taking a greater width (the bounding box would go a bit too far to the right, but still encapsulate the object). I don't know if this is necessarily a problem, or a fact that should be exploited in some way or something that should be "put pressure on". If this is something you are concerned about, I suppose you could alternatively define the bounding box as follows (I'm not sure whether or not this is what's done in the repository linked above):

X coordinate of top-left corner of bounding box
Y coordinate of top-left corner of bounding box
X coordinate of bottom-right corner of bounding box
Y coordinate of bottom-right corner of bounding box

Intuitively, I suspect the relation between those two corner points will be less strong than the relation you identified exists between center + width + height. A "mistake" in coordinates of the top-left corner cannot be partially "fixed" by placing the bottom-right corner somewhere else.

I will try out this kind of loss and compare it with mine on same architecture, maybe there will be difference in convergence time. — don_pablito, Jul 24 '18 at 15:09

What loss function should one use for object detection, knowing that the input image contains exactly one target object?

1 Answers1