0

I am looking for a way to convert images of a PDF to a real text file.

I have tried using Okular, GSCAN2PDF, GIMP, ImageMagick and XPDF but I am having problems with all of them, maybe because I don't have much experience with them and I'm struggling to understand the instructions I've found. I would appreciate a beginner-level explanation.

Zanna
  • 70,465
  • Can you clarify what you mean by "images of a PDF"? Have you seen this question? http://askubuntu.com/questions/59389/how-can-i-extract-text-from-images – Seth Mar 25 '15 at 00:17

1 Answers1

3

First install poppler-utils which contains Pdfimages. Pdfimages is a tool command line, which allows to extract all images from a PDF file and save them as JPEG files.

Open a terminal, by pressing Ctrl+Alt+T

Install the software:

sudo apt-get update
sudo apt-get install poppler-utils

The syntax of this tool is:

pdfimages -j file.pdf output_directory

Where file.pdf is the file you want to extract images and output_directory is the directory where you want to save the images.

Images are saved in the following format:

output_directory/output_directory-nnn.jpg

It's funny, but they are named with the same name of the directory where you extracted, a consecutive number and extension.

Second, just install an application for ocr, for example ocrfeeder:

sudo apt-get update
sudo apt-get install tesseract-ocr ocrfeeder tesseract-ocr-eng gocr cuneiform ocropusocrad

Once the program opens, select the search engine you want to use. Select the Edit menu and select Preferences from the dropdown menu.

In the dialog that opens select the Tools tab. Here you will see an option that puts favorite engine. In this option, select Tesseract and then press the OK button.

After completing the settings we can start with the action

To do this we press the + symbol:

Then select the image file you want to open.

If necessary retouch the image, only you have to access the Tools menu. Once inside the Tools menu select the option unpaper. The display will find various options and filters to retouch the image.

Zanna
  • 70,465
kyodake
  • 15,401