Regular PDF files consist of vector elements, such as text and vector graphics, and other embedded data, such as image files. Extracting the latter is quite easy with utilities such as pdfimages
(as described in this Q&A).
On the other hand, scanned PDF documents are compilations of scanned pages. Every single page is a bitmap image, possibly overlayed with a searchable text layer produced by OCR. As a result, running pdfimages
on a scanned PDF document will merely extract the scanned pages.
What I am looking for is an application or command-line utility that can distinguish between images and text in a scanned PDF document and extract the former.
Does anything like this exist?