I've been trying to design an algorithm for aligning an object across two photos in realtime. I am able to localize the object (create an ROI/BBox) through an object detection (siamese) network for both photos, leaving me with a cropped image with the object in the center for both frames.
I am struggling to do the second step of my proposed algorithm, which is "finetuning" the alignment of one of the objects in the bounding box by calculating a homography with respect to the "template" crop. Ideally, I would like to be able to overlay the transformed second image on the first image and have the object roughly aligned in the image.
I have been trying to use cv2.findHomography
using ORB descriptors and keypoints, however I find that the ORB extractor or the feature matcher ends up being inaccurate, leading to terrible transformation estimates when the object becomes non-stationary. ORB especially seems hard to tune, with the parameter that picks up enough keypoints being fastThreshold=5
which seems abnormally low to me. I use ORB features since I am aiming to use a "free" algorithm, unlike SURF or SIFT.
For context, here is an example of the cropped images I'm trying to align after the object is rotated w.r.t. the bounding box. The subject is a keycap I had laying around. The original image isn't that blurry -- I applied cv2.blur(img, (15,15))
before attempting to detect feature and match them.
Aligning these images seems like it should be a simple task, but I'm increasingly worried that classical CV is insufficient for alignment. My background is not in classical CV and so I'm not sure if there is some trick/algorithm that can efficiently align these objects. Alternatively, suggestions for heuristics to determine robust parameters to these OpenCV methods or engineer my data to be more suitable would be appreciated.
My background is biased towards neural networks and I'm worried that I need to turn to neural network-based approaches for image alignment or replace my object detector with a more end-to-end estimate of the orientation of the bounding box. Either way, the aim would be to perform this alignment in realtime on consumer hardware.
Thanks!