I have some images with a fixed background and a single object on them which is placed, in each image, at a different position on that background. I want to find a way to extract, in an unsupervised way, the positions of that object. For example, us, as humans, would record the x and y location of the object. Of course the NN doesn't have a notion of x and y, but i would like, given an image, the NN to produce 2 numbers, that preserve as much as possible from the actual relative position of objects on the background. For example, if 3 objects are equally spaced on a straight line (in 3 of the images), I would like the 2 numbers produced by the NN for each of the 3 images to preserve this ordering, even if they won't form a straight line. They can form a weird curve, but as long as the order is correct that can be topologically transformed to the right, straight line. Can someone suggest me any paper/architecture that did something similar? Thank you!
Asked
Active
Viewed 1,216 times
4
-
1By fixed, you mean the background is always the same image ? – Astariul Sep 17 '19 at 04:35
-
@Astariul yes! and the object that changes position is also the same in each image (same size, shape, orientation etc.). – Silviu-Marian Udrescu Sep 17 '19 at 05:45
-
1@Silviu-MarianUdrescu you mightn't need machine learning for this, it sound's like the object is very defined. If you can code something up that works 100% of the time why not do that? – Recessive Sep 17 '19 at 06:09
-
@Recessive Ideally I want a NN that is able to learn the 2 numbers representation of an image, for any set of background and object moving. Coding it by hand works for only a fix background and a fix image (which is indeed what my post is about), but I want a NN approach so I can later generalize i.e. if I pass a new set of images (all with the same background and object, but different from the ones in the previous set of images) the NN would identify the "x" and "y" just as easily without any coding modifications. Coding something up manually would require a new code for each set of images. – Silviu-Marian Udrescu Sep 17 '19 at 06:15
-
@Silviu-MarianUdrescu There should be a few ways of doing this. You could create a regression CNN that outputs (x,y) coordinates (i wouldn't recommend this, it's very hard to get this to work from my experience). You could use deconvolutional layers to produce an output image of similar dimensions to the input image, using a softmax on the entire image to produce a probability of the image being at each location (you could also produce a downscaled version if approximate location is ok). Other then that I think there's so great resources online for object detection, just google search. – Recessive Sep 17 '19 at 06:29
-
I am not sure how can I use a regression CNN at all in an unsupervised way in this case. I am not sure I understand what you mean by using deconvolutions and softmax. How can I use that to extract the x and y of the object? – Silviu-Marian Udrescu Sep 17 '19 at 06:33
-
@Silviu-MarianUdrescu Ahh sorry, I didn't see unsupervised. Ok well that changes things slightly. If you really want to use unsupervised learning, I'm sorry I can't help as I have no experience with that. However, this problem sounds like it's possible to complete using supervised learning. As I suggested earlier, you could perhaps hard code a program to extract the (x,y) point of the objects over a few hundred/thousand images, and using that you could then train a neural network. This should then improve robustness to future input changes. – Recessive Sep 17 '19 at 08:05
-
@Silviu-MarianUdrescu As for deconvolutions perhaps ignore that, you might be able to get away with the following: Take a standard CNN, and just have the output be, say 49 nodes for a 7x7 image. This means that if node 4 has a value of 0.9, point (4,1) is 90% likely to contain the object. As for softmax, search it up online, it essentially converts the inputs to a probability distribution across all the output nodes. – Recessive Sep 17 '19 at 08:07
1 Answers
1
As said in the comments, I wouldn't use Machine Learning for that.
You can achieve that result using something like OpenCV.
For example:
- Get the "Naked" Background image: If you don't have it, you can easily calculate it by making an average of each image:
background = np.mean(images, axis=0)
- For each image, calculate the pixel difference between image and background.
diffs = [img - background for img in images]
- Diff's pixels can be negative, so take the absolute value of each pixel before converting it to grayscale.
- If all goes well, you now have a dark noised image, with a bright silhouette of your object.
- Set a threshold (i.e.
threshold = diff.percentile(95)
) and make a binary mask, so now each pixel indicates1
for image silhouette and0
for background. - Find the centroid of the object (like calculating the average coordinates for each pixel=1). And there you have it!
Of course, I just described one clear and easy way to do it. But you can find your own best solution.
- ✅ Don't need to train a neural network
- ✅ Don't need to label data
- ✅ Works for any set of image / background
- ✅ Precise coordinates
- ✅ Easy to make, debug and adapt.
- ✅ Runs fast

Andre Goulart
- 854
- 2
- 25