3

Regular PDF files consist of vector elements, such as text and vector graphics, and other embedded data, such as image files. Extracting the latter is quite easy with utilities such as pdfimages (as described in this Q&A).

On the other hand, scanned PDF documents are compilations of scanned pages. Every single page is a bitmap image, possibly overlayed with a searchable text layer produced by OCR. As a result, running pdfimages on a scanned PDF document will merely extract the scanned pages.

What I am looking for is an application or command-line utility that can distinguish between images and text in a scanned PDF document and extract the former.

Does anything like this exist?

  • Just for you understanding what is scanned pdf view this pdf http://www.2shared.com/document/NCf1JOei/Modern_Control_Engineering__4t.html – user3446207 Sep 11 '14 at 10:39
  • Above is a pdf but we cannot read it even if you search some text it will retrurn nothing – user3446207 Sep 11 '14 at 10:40
  • Unfortunately I wasn't able to find a proper solution for PDF files, only this python script that can process single images. If you don't manage to get an answer here you might want to try asking at diybookscanner, the largest forum dedicated to document scanning and archiving on the web. – Glutanimate Sep 11 '14 at 14:25

1 Answers1

2

Use pdfimages a PDF image extractor tool

Usage: pdfimages [options] <PDF-file> <image-root>

Example: Save images in JPEG format

pdfimages -j in.pdf /tmp/out

PS: someone, please mark this as duplicate: Extracting embedded images from a PDF [creadits goes to pl1nk: https://askubuntu.com/users/48864/pl1nk ]

  • I know about that tool but it is for regular pdf when I try it on scanned pdf files it extract all page and it is not working for images in scanned pdf – user3446207 Sep 11 '14 at 10:17
  • 1
    @user3446207 That's because in a scanned PDF file each page is an image. Your scanner takes pictures of your document and the scanning software embeds these images in a PDF file. I don't know of any automated solution that would be able to distinguish between text areas and image areas in a scanned page. You could use ScanTailor to mark text and images, but that's not automated. – Glutanimate Sep 11 '14 at 12:13