Image-in image-out neural network architectures

Question

With an RGB image of a paper sheet with text, I want to obtain an output image which is cropped and deskewed. Example of input:

I have tried non-AI tools (such as openCV.findContours) to find the 4 corners of the sheet, but it's not very robust in some lighting conditions, or if there are other elements on the photo.

So I see two options:

a NN with input=image, output=image, that does everything (including the deskewing, and even also the brightness adjustment). I'll just train it with thousands of images.
a NN with input=image, output=coordinates_of_4_corners. Then I'll do the cropping + deskewing with a homographic transform, and brightness adjustment with standard non-AI tools

Which approach would you use?

More generally what kind of architecture of neural network would you use in the general case input=image, output=image?

Is approach #2, for which input=image, output=coordinates possible? Or is there another segmentation method you would use here?

have you tried preprocessing the image before applying findcontours? Like contrast boosting, conversion to hsv space, masking with edge detection. A neural net for such task sounds a real overkill to me, and complicate cause of the annotations that it would require. — Edoardo Guerriero, Mar 01 '22 at 19:12
I agree that a neural net seems like overkill. That being said, this paper seems interesting: "Convolutional Neural Network Architecture for Geometric Matching" Ignacio Rocco, Relja Arandjelovic, Josef Sivic https://openaccess.thecvf.com/content_cvpr_2017/html/Rocco_Convolutional_Neural_Network_CVPR_2017_paper.html — The Guy with The Hat, Mar 02 '22 at 02:32

score 5 · Answer 1 · answered Mar 01 '22 at 16:19

I think the second approach will be the best because it only requires that your training set is annotated with four labels for each of the four corners of the paper sheet.

This is sort of the idea of a Region Proposal Network which is used in Faster R-CNN (section 3.1).

Here is a reference implementation of a Region Proposal Network in PyTorch from the torchvision library. Notice how the network outputs boxes (in the forward() method) which is a tuple (x1, y1, x2, y2). From these four coordinates, you could crop the image to the desired paper sheet region.

score 2 · Answer 2 · answered Mar 01 '22 at 15:44

2

You could try U-Net for approach 1.

This is called the image-to-image translation problem in machine learning. You could see more architectures in this paper: https://arxiv.org/pdf/2101.08629.pdf

answered Mar 01 '22 at 15:44

Mehmet Alican Noyan

21
3

1

A U-net wouldn't help a whole lot if it's being spatially transformed, since the skip connections wouldn't line up correctly between the input and output. That said, I guess a U-net is probably better than an autoencoder-like model without _any_ skip connections, so idk. – The Guy with The Hat Mar 02 '22 at 02:24
1

Thanks for your answer. – logijaz Mar 03 '22 at 08:55
1

@TheGuywithTheHat What kind of neural network would you use as a general tool for the general problem of image=in image=out, and you want the NN the learn transformations of images on thousands of examples? I would be interested in an answer about this, if you have one! Thanks – logijaz Mar 03 '22 at 08:56
2

@logijaz Accurate spatial transformations are something that current neural net architectures are just fundamentally unsuited to solve. U-nets connect pixels that are in the same location in the input and output, but with spatial transformations those pixels don't correspond to one another, so it's pointless. Other architectures can try to transform the image, but will lose a significant amount of detail (e.g. any text will become gibberish or disappear entirely). If you _really_ want to involve a neural net, use it to find the corners, and then use traditional methods to transform the image. – The Guy with The Hat Mar 03 '22 at 22:20

Image-in image-out neural network architectures

2 Answers2