Could it be possible to detect text, symbols, and components directly in a scanned PDF file with a program like Tensorflow or another program?

Question

I have this problem where I need to get information out of PDF document sent from a scanner. The program needs to be learnable in some way to recognize what different figures mean. Most of this should happen without human interference so it could just give a result after scanning the file. Do anyone know if it's possible to do with a machine learning program or any alternative way?

score 1 · Answer 1 · answered Jan 16 '19 at 10:36

1

Yes, that's possible. I am working on a project in which I have to detect text in images. I did a quick search and found these two algorithms:

1. EAST: (Efficient and Accurate Scene Text Detector)
I am not sure if it is based on Machine Learning. Here are some links link1 link2 explaining how to use it with an example and using tesseract to extract the detected text.

2. CTPN: (Connectionist Text Proposal Network)
This algorithm is based on Machine Learning. Here is its link in github. In the description, you will find a link to a pre-trained model that you can use. Or simply, you can prepare your own data and train your own model.

For me, I tried both of them, and the CTPN model gave better results especially when the image contains large text.

answered Jan 16 '19 at 10:36

singrium

145
1
9

Thank you very much for the answer but looked a bit on the website and can see that the programs have a hard time to find the symbols that I also have to find. The text has to find can be small and also sideways sometimes. Example of text: https://ibb.co/jfjhW8J – Nicolaj Juul Jan 17 '19 at 09:09
I applied the ctpn algorithm on your image, and this is [the result](https://ibb.co/CQbVvGC). I think it is not so precise since it didn't detect all the text in the image. Well, may be if you train it with your own data (which would take a lot of time), the results would be more precise. If you find any better method please post it here so all the community would benefit from it. Thank you! – singrium Jan 17 '19 at 09:24
I probably need to train my own program than to do it for me. Do you still think the best platform for a project like this would be CTPN? And do you know a machine learning program where you can evaluate the results afterward to tell it what it did right and wrong? – Nicolaj Juul Jan 17 '19 at 09:40
In my situation, CTPN was more precise, but in general I am not sure.. It strongly depends on the characteristics of the text you want to detect. To my knowledge, there is no like 'machine learning program to evaluate the results' you can evaluate your model by checking the accuracy, validation accuracy and the loss while doing the training. and after you finish the training, you can test your model by giving it some images and see if it detects all the text zones or not. – singrium Jan 17 '19 at 09:52

Could it be possible to detect text, symbols, and components directly in a scanned PDF file with a program like Tensorflow or another program?

1 Answers1