0

Currently, I am looking at how Mask R-CNN works. It has a backbone, RPN, heads, etc. The backbone is used for creating the feature maps, which are then passed to the RPN to create proposals. Those proposals would then be aligned with feature maps and rescaled to some $n \times n$ pixels before entering box head or mask head or keypoint head.

Since conv2D is not scale-invariant, I think this scaling to $n \times n$ would introduce scale-invariant characteristics.

For an object that is occluded or truncated, I think scaling to $n \times n$ is not really appropriate.

Is it possible if I predict the visibility of the object inside the box head (outputting not only xyxy [bounding box output], but also xyxy+x_size y_size [bounding box output + width height scale of object]). This x_size and y_size value would then be used to rescale $n \times n$ input.

So, if only half of the object is seen (occluded or truncated), inputs inside the keypoint head or mask head would be 0.5x by 0.5x.

Is this a good approach to counter occlusion and truncation?

  • 1
    Can you please clarify this part "outputting not only xyxy, but also xyxy x_size y_size"? What `xyxy` and `xyxy x_size y_size`? What does that mean? Note that I changed from x by x to $n \times n$ because of latex. – nbro Feb 07 '21 at 02:16
  • Editted for clarifitcation xyxy means xmin, ymin, xmax, ymax (i believe this is bounding box output) or xywh (xmin, ymin, width, height) – Darwin Harianto Feb 08 '21 at 00:39

0 Answers0