1

I tried to add a textlayer to some pdf files in order to make them searchable. This technique is explained in the german Ubuntu wiki: http://wiki.ubuntuusers.de/pdfsandwich . After installing dependencies

sudo apt-get install imagemagick exactimage ghostscript tesseract-ocr

and pdfsandwich itself it should be as simple as

pdfsandwich test.pdf

However I get:

Input file: "test.pdf"
Output file: "test_ocr.pdf"
Number of pages in inputfile: 272

Parallel processing with 8 threads started.
Processing page order may differ from original page order.

Processing page 137.
Processing page 171.
Processing page 1.
PProcessing page Processing pProcessing page rocess35.
age 239.
Processing page 69.
205.
ing page 103.
sh: 1: cannot open /tmp/pdfsandwich4e375e.html: No such file

followed by many more cannot open ... warnings. Inspection of my /tmpdirectory shows that instead of these *.html files the corresponding *.txt files exist. Seemingly tesseract does not output in hocr format. I read the man pages of tesseract and tried to enforce hocr output by creating a config file named tesseract-config

hocr true

(I tried various variations thereof) and starting pdfsandwich with

pdfsandwich -tesso tesseract-config test.pdf

But this does not seem to change anything. Any ideas how I can make pdfsandwich produce proper output?

Note the related questions How to add OCRed text to original pdf in gscan2pdf? and Adding OCR info to a PDF . However I need to process many pdf files and therefore I need a command-line solution which I can automate.

  • I did not meet this problem yet. Could you please specify (1) Which version of Ubuntu and tesseract you used when the problem occured? Was it the same version you use now (3.02.01) or some older version? (2) Did you use the Ubuntu deb package of pdfsandwich? This package should resolve all dependencies and ensure you get the correct versions of the packages pdfsandwich needs. See the "Download and Installation" section of the manual: http://www.tobias-elze.de/pdfsandwich/ If your problem is a general problem with some Ubuntu version, I will address this in future versions of pdfsandwich. Tobia –  Aug 07 '13 at 08:47
  • Yes, it was the same version. I used kubuntu 13.04 packages, 64 bit. I solved the problems as described in the answer. A remaining problem was that, with any pdf I tried, the text layer did not mach the picture on the last few pages. Would be great if it was fixed. – highsciguy Aug 12 '13 at 18:28

2 Answers2

1

It turned out that the format of the config file changed with the present ubuntu version of tesseract (3.02.01): http://code.google.com/p/tesseract-ocr/source/browse/trunk/tessdata/configs/hocr?r=526 . Tesseract can now be instructed to output in hocr format with a single line configuration file tesseract-config:

tessedit_create_hocr 1

As noted in the question, tesseract can be instructed to read the config file by passing the -tesso option to pdfsandwich:

pdfsandwich -tesso tesseract-config test.pdf
1

The reason for this error is that tesseract changed its default file extensions for hocr, making it incompatible with pdfsandwich <0.1.0. For tesseract 3.02 with pdfsandwich <0.1.0, it helps to modify the tesseract option file and to pass it to pdfsandwich with -tesso.

Tesseract 3.03, which is the default tesseract version in Ubuntu 14.04, substantially changed its hocr handling, making it partially incompatible with hocr2pdf, so that the "-tesso" fix will often result in text layers which do not fit to the scanned images. Instead, not hocr2pdf but tesseract itself needs to be used to produce each single page of pdf files.

Pdfsandwich >=0.1.0 automatically recognizes the tesseract version and chooses the appropriate way of interaction with tesseract, so that all these errors do not occur anymore.

Tobias Elze
  • 349
  • 3
  • 3