3

I am looking for software that recognizes text within images. I tried out all of the tools mentioned here (gocr, fuzzyocr, libhocr0, ocrad, ocrfeeder, ocropus, tesseract-ocr, cuneiform). My input was a photograph of a printed document, hence not hand writing, just printed letters. Of all the tools, tesseract-ocr is the most accurate in my tests, but it still produces many many errors. Hence, scanning a document to some image file, and then continuing with indexing it or performing some NLP, sadly isn't an option. The error rate is too high.

So, given the age of the above mentioned posting, are there any better tools for extracting text from images or photographs?

EDIT 1:

With "image containing text" I mean, that I have a PNG/JPG/BMP file as a source and that I want to extract the pixelized text within it and have an ASCII/UTF-8 text as result and output.

Zanna
  • 70,465
Socrates
  • 2,473
  • I expect the issue you experience is likely due to a poor photo - try using an app like Office Lens or PhotoScan by google to get a straightened image – Tim May 18 '17 at 16:19
  • @Tim The idea is to index existing images containing text, so that those may be searched via text search. The Google Product works nice, but due to security and privacy issues not an option. Same for Microsofts OfficeLens. And I am specifically looking for a solution within Linux / Ubuntu. – Socrates May 18 '17 at 16:29
  • okay. My next suggestion was to upload them to OneDrive as it does auto OCR. Sorry I can't help more. – Tim May 18 '17 at 16:30
  • When you say "image containing text", do you mean ASCII text or pixel-ized text? I presume pixelized, otherwise the strings command could help? –  May 18 '17 at 16:34
  • 1
    What is the language of the texts? (consider installing tesseract-langpack-...). What is the image resolution? See also tesseract4 and the comercial abbyyfinereader –  May 18 '17 at 16:43
  • @WillemK Posted Edit 1. – Socrates May 18 '17 at 17:33
  • @JJoao The language would be English within my tests. The resolution should be ok. One big letter 'S' would be 15x21. One small 'c' would be 12x16. – Socrates May 18 '17 at 17:45
  • @JJao Ok, I installed the package tesseract-ocr-all and it works much better now with exactly the same input file I had before for test purposes. Now, my result is a plain text file. Is there a way to also recognize italic and bold text? Apparently there is a possibility when programming. – Socrates May 18 '17 at 17:59
  • @JJoao Is there any way to decrypt hand-written text? I've tried with tesseract, but the results are only gibberish. – Socrates May 20 '17 at 01:19
  • Hand written text can be done, but you need to train tesseract. Tesseract uses some machine learning algorithms (a neural network) that are pre-trained depending on launguage, but you can train them to recognize pretty much anything, even hand-written text, as explained here. – dadexix86 Aug 05 '18 at 18:01

0 Answers0