Currently, I am looking at how Mask R-CNN works. It has a backbone, RPN, heads, etc. The backbone is used for creating the feature maps, which are then passed to the RPN to create proposals. Those proposals would then be aligned with feature maps and rescaled to some $n \times n$ pixels before entering box head or mask head or keypoint head.
Since conv2D is not scale-invariant, I think this scaling to $n \times n$ would introduce scale-invariant characteristics.
For an object that is occluded or truncated, I think scaling to $n \times n$ is not really appropriate.
Is it possible if I predict the visibility of the object inside the box head (outputting not only xyxy [bounding box output], but also xyxy+x_size y_size [bounding box output + width height scale of object]). This x_size and y_size value would then be used to rescale $n \times n$ input.
So, if only half of the object is seen (occluded or truncated), inputs inside the keypoint head or mask head would be 0.5x by 0.5x.
Is this a good approach to counter occlusion and truncation?