I am currently using MSE to predict the center of ROI coordinates and its width and height. All values are relative to image size. I think that such an approach does not put enough pressure on the fact those coordinates are related.
At first glance, this looks quite reasonable. Computer vision is not really my main area of expertise, so I did some googling around, and one of the first repositories I ran into does something very similar. It may be interesting for you to look into the code and the references in that repository in more detail.
It looks to me like they're also using the MSE loss function. I'm not 100% sure how they define the bounding boxes, maybe you can figure it out by digging through the code. You currently define bounding boxes by:
- X coordinate of center of bounding box
- Y coordinate of center of bounding box
- Width of bounding box
- Height of bounding box
You are right in that these coordinates are quite closely related. If the center is incorrect (for example, a bit too far to the right), that mistake could partially be "fixed" by taking a greater width (the bounding box would go a bit too far to the right, but still encapsulate the object). I don't know if this is necessarily a problem, or a fact that should be exploited in some way or something that should be "put pressure on". If this is something you are concerned about, I suppose you could alternatively define the bounding box as follows (I'm not sure whether or not this is what's done in the repository linked above):
- X coordinate of top-left corner of bounding box
- Y coordinate of top-left corner of bounding box
- X coordinate of bottom-right corner of bounding box
- Y coordinate of bottom-right corner of bounding box
Intuitively, I suspect the relation between those two corner points will be less strong than the relation you identified exists between center + width + height. A "mistake" in coordinates of the top-left corner cannot be partially "fixed" by placing the bottom-right corner somewhere else.