71

I'm looking for a good program to show me the differences between two similar pdf files. In particular, I'm looking for something that doesn't just run diff on an ascii version (with "pdftotext") of the files. This is what pdfdiff.py does.

Braiam
  • 67,791
  • 32
  • 179
  • 269

7 Answers7

46

You can use DiffPDF for this. From the description:

DiffPDF is used to compare two PDF files. By default the comparison is of the text on each pair of pages, but comparing the appearance of pages is also supported (for example, if a diagram is changed or a paragraph reformatted). It is also possible to c> ompare particular pages or page ranges. For example, if there are two versions of a PDF file, one with pages 1-12 and the other with pages 1-13 because of an extra page having been added as page 4, they can be compared by specifying two page ranges, 1-12 for the first and 1-3, 5-13 for the second. This will make DiffPDF compare pages in the pairs (1, 1), (2, 2), (3, 3), (4, 5), (5, 6), and so on, to (12, 13).

qbi
  • 19,125
  • 5
    This is the best I've seen. The only issue I see is that it compares the pdfs page-for-page. So if you add a paragraph on say, page 1, the beggining and end of every page after that does not match. :( – krumpelstiltskin May 08 '11 at 00:21
  • 10
    I think the link is no longer correct. The new version 3.* seems to be available only for windows. The old version 2.* can still be installed via sudo apt-get install diffpdf, though. – Peter Zeller Apr 09 '14 at 15:48
39

I just figured out a hack to make DiffPDF (the program suggested by @qbi) usable for more than minor changes. What I do is concatenate all pages pdfs into a long scroll using pdfjam and then compare the scrolls. It works even when large sections are removed or inserted!

Here is a bash script that does the job:

#!/bin/bash
#
# Compare two PDF files.
# Dependencies:
#  - pdfinfo (xpdf)
#  - pdfjam  (texlive-extra-utils)
#  - diffpdf
#

MAX_HEIGHT=15840  #The maximum height of a page (in points), limited by pdfjam.

TMPFILE1=$(mktemp /tmp/XXXXXX.pdf)
TMPFILE2=$(mktemp /tmp/XXXXXX.pdf)

usage="usage: scrolldiff -h FILE1.pdf FILE2.pdf
  -h print this message

v0.0"

while getopts "h" OPTIONS ; do
    case ${OPTIONS} in
        h|-help) echo "${usage}"; exit;;
    esac
done
shift $(($OPTIND - 1))

if [ -z "$1" ] || [ -z "$2" ] || [ ! -f "$1" ] || [ ! -f "$2" ]
then
  echo "ERROR: input files do not exist."
  echo
  echo "$usage"
  exit
fi

    #Get the number of pages:
pages1=$( pdfinfo "$1" | grep 'Pages' - | awk '{print $2}' )
pages2=$( pdfinfo "$2" | grep 'Pages' - | awk '{print $2}' )
numpages=$pages2
if [[ $pages1 > $pages2 ]]
then
  numpages=$pages1
fi

     #Get the paper size:
width1=$( pdfinfo "$1" | grep 'Page size' | awk '{print $3}' )
height1=$( pdfinfo "$1" | grep 'Page size' | awk '{print $5}' )
width2=$( pdfinfo "$2" | grep 'Page size' | awk '{print $3}' )
height2=$( pdfinfo "$2" | grep 'Page size' | awk '{print $5}' )

if [ $(bc <<< "$width1 < $width2") -eq 1 ]
then
  width1=$width2
fi
if [ $(bc <<< "$height1 < $height2") -eq 1 ]
then
  height1=$height2
fi

height=$( echo "scale=2; $height1 * $numpages" | bc )
if [ $(bc <<< "$MAX_HEIGHT < $height") -eq 1 ]
then
  height=$MAX_HEIGHT
fi
papersize="${width1}pt,${height}pt"



    #Make the scrolls:
pdfj="pdfjam --nup 1x$numpages --papersize {${papersize}} --outfile"
$pdfj "$TMPFILE1" "$1"
$pdfj "$TMPFILE2" "$2"

diffpdf "$TMPFILE1" "$TMPFILE2"

rm -f $TMPFILE1 $TMPFILE2
  • 3
    I made your script whitespace-compatible and added unique tempfiles. I hope you don't mind. – Glutanimate May 19 '13 at 16:59
  • 2
    Also fixed a small bug where the script would create an empty text file in the working directory. (remember to always use double brackets with if statements that use ">" and related operands.) – Glutanimate May 19 '13 at 17:10
  • 2
    One last remark: This script will work fine only for DIN A4 sized documents. You will have to adjust the PAGEHEIGHT value to get it to work with smaller documents. I'm sure there's a way to automate this but don't know how atm. – Glutanimate May 19 '13 at 17:21
  • 2
    Thanks for making the improvements @Glutanimate. I've added support for comparison of pdfs of arbitrary and differing sizes (as long as the pages within each pdf are of uniform size, that is). – krumpelstiltskin May 19 '13 at 20:11
  • saved to a gist for convenience https://gist.github.com/timabell/9616807b2fe3fa60f234 – Tim Abell Nov 17 '15 at 17:27
  • Style question - why bc <<< sometimes and echo | bc others? – Greg Bell Jul 09 '16 at 23:01
  • I can't comment on bash programming style. I know very little about bash programming. Sorry. – krumpelstiltskin Jul 12 '16 at 13:09
  • 1
    I have to manually set the page width, or pdf-jam creates a VERY wide pdf, with the text centered but completely out of scale. I could not find why – Emilio Jul 10 '20 at 09:13
  • 1
    diffPDF has been unmaintained for more than 10 years, in favor of the comercial version (which only runs on windows). Unfortunately, 10 years without being maintained makes the software quite glitchy. Are there any newer, maintained applications to compare PDFs? – chronos00 Dec 13 '23 at 18:39
  • 1
    This is extremely helpful @krumpelstiltskin. Even as of today, 10y later, I'm struggling to find open source tools to achieve the seemingly simple task of comparing pdfs, with DiffPDF + "scrolling" being the best approach apparently. One thing: it seems that DiffPDF struggles to compare long pdfs with this method. I have two +70p pdfs and after page 22 diffpdf already shows them as if they had no difference from that point onwards (with this not being the case). – Daniel Mar 17 '24 at 17:11
  • @Daniel: shoot. I haven't tried it on PDFs with than many pages :( – krumpelstiltskin Mar 18 '24 at 18:21
17

Even though this doesn't solve the issue directly, here is a nice way to do it all from the commandline with few dependencies:

diff <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)

https://linux.die.net/man/1/pdftotext

It works really well for basic pdf comparisons. If you have a newer version of pdftotext you can try -bbox instead of -layout.

As far as diffing programs go, I like using diffuse, so the command changes ever so slightly:

diffuse <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)

http://diffuse.sourceforge.net/

Hope that helps.

phyatt
  • 271
  • 2
  • 6
5

As complement to the above answer about diff and diffuse we can use Meld as graphical comparison tool - install it with

sudo apt-get install meld

and then compare documents with command like

meld <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)

Personally I like Meld more than DiffUse of Kdiff3.

N0rbert
  • 99,918
5

We've been working on a tool and wanted to chime in.

If you're happy with trying an online tool, we built something at Draftable.com which does what you seem to want - compare two PDF/Word files and show deletions and additions.

Right now, our Desktop version is Windows only; but, we also have an API that we published a few years ago and it has been working very well for people with high volumes or security concerns.

I've prepared an image (link below) so that you can see the kind of output you'd get without needing to visit the site. Feedback greatly appreciated!

Sample comparison

Kevin Bowen
  • 19,615
  • 55
  • 79
  • 83
4

If you have 2-3 huge pdf (or epub or other formats, read below) files to compare , then it is possible to combine the power of:

  1. calibre (to convert your source to text)

  2. meld (to visually search for the differences between the text files)

  3. parallel (to use all your system cores to speed up)

Below script accept as input any of the following file formats: MOBI, LIT, PRC, EPUB, ODT, HTML, CBR, CBZ, RTF, TXT, PDF and LRS.

If not installed, then install meld, calibre and parallel:

#install packages
sudo apt-get -y install meld calibre parallel

To be able to execute the code from anywhere in your computer, save following code in a file named "diffepub" (with no extensions) inside directory "/usr/local/bin".

usage="
*** usage:

diffepub - compare text in two files. Valid format for input files are:
MOBI, LIT, PRC, EPUB, ODT, HTML, CBR, CBZ, RTF, TXT, PDF and LRS.

diffepub -h | FILE1 FILE2

-h print this message

Example:
diffepub my_file1.pdf my_file2.pdf
diffepub my_file1.epub my_file2.epub

v0.2 (added parallel and 3 files processing)
"

#parse command line options
while getopts "h" OPTIONS ; do
  case ${OPTIONS} in
    h|-help) echo "${usage}"; exit;;
  esac
done
shift $(($OPTIND - 1))

#check if first 2 command line arguments are files
if [ -z "$1" ] || [ -z "$2" ] || [ ! -f "$1" ] || [ ! -f "$2" ]
then
  echo "ERROR: input files do not exist."
  echo
  echo "$usage"
  exit
fi



#create temporary files (first & last 10 characters of
# input files w/o extension)
file1=`basename "$1" | sed -r -e '
s/\..*$//                     #strip file extension
s/(^.{1,10}).*(.{10})/\1__\2/ #take first-last 10 chars
s/$/_XXX.txt/                 #add tmp file extension
'`
TMPFILE1=$(mktemp --tmpdir "$file1")

file2=`basename "$2" | sed -r -e '
s/\..*$//                     #strip file extension
s/(^.{1,10}).*(.{10})/\1__\2/ #take first-last 10 chars
s/$/_XXX.txt/                 #add tmp file extension
'`
TMPFILE2=$(mktemp --tmpdir "$file2")

if [ "$#" -gt 2 ] 
then
  file3=`basename "$3" | sed -r -e '
  s/\..*$//                     #strip file extension
  s/(^.{1,10}).*(.{10})/\1__\2/ #take first-last 10 chars
  s/$/_XXX.txt/                 #add tmp file extension
  '`
  TMPFILE3=$(mktemp --tmpdir "$file3")
fi

#convert to txt and compare using meld
doit(){ #to solve __space__ between filenames and parallel
  ebook-convert $1
}
export -f doit
if [ "$#" -gt 2 ] 
then
  (parallel doit ::: "$1 $TMPFILE1" \
                     "$2 $TMPFILE2" \
                     "$3 $TMPFILE3" ) &&
  (meld "$TMPFILE1" "$TMPFILE2" "$TMPFILE3")
else
  (parallel doit ::: "$1 $TMPFILE1" \
                     "$2 $TMPFILE2" ) &&
  (meld "$TMPFILE1" "$TMPFILE2")
fi

Make sure the owner is your user and it has execution permissions:

sudo chown $USER:$USER /usr/local/bin/diffepub
sudo chmod 700 /usr/local/bin/diffepub

To test it, just type:

diffepub FILE1 FILE2

I test it to compare 2 revisions of a +1600 pages pdf and it works perfect. Because calibre is written using python for portability, it took 10 minutes to convert both files to text. Slow, but reliable.

luis_js
  • 223
  • 2
  • 4
0

Following up on the answer of krumpelstiltskin which did not run on my Mac. Here is a version that runs on MacOS.

#!/bin/bash
#
# 2023, updated to run on MacOS Ventura v13
#
# Compare two PDF files.
# Dependencies:
#  - pdfinfo (xpdf)
#  - pdfjam  (texlive-extra-utils)
#  - diffpdf
#

MAX_HEIGHT=15840 #The maximum height of a page (in points), limited by pdfjam.

#TMPFILE1=$(mktemp /tmp/XXXXXX.pdf) #TMPFILE2=$(mktemp /tmp/XXXXXX.pdf) #mktemp /tmp/XXXXXX.pdf failed on MacOS TMPFILE1=$(mktemp) TMPFILE2=$(mktemp)

usage="usage: scrolldiff -h FILE1.pdf FILE2.pdf -h print this message

v0.0"

while getopts "h" OPTIONS ; do case ${OPTIONS} in h|-help) echo "${usage}"; exit;; esac done shift $(($OPTIND - 1))

if [ -z "$1" ] || [ -z "$2" ] || [ ! -f "$1" ] || [ ! -f "$2" ] then echo "ERROR: input files do not exist." echo echo "$usage" exit fi

#Get the number of pages:

pages1=$( pdfinfo "$1" | grep 'Pages' - | awk '{print $2}' ) pages2=$( pdfinfo "$2" | grep 'Pages' - | awk '{print $2}' ) numpages=$pages2 if [[ $pages1 > $pages2 ]] then numpages=$pages1 fi

 #Get the paper size:

width1=$( pdfinfo "$1" | grep 'Page size' | awk '{print $3}' ) height1=$( pdfinfo "$1" | grep 'Page size' | awk '{print $5}' ) width2=$( pdfinfo "$2" | grep 'Page size' | awk '{print $3}' ) height2=$( pdfinfo "$2" | grep 'Page size' | awk '{print $5}' )

if [ $(bc <<< "$width1 < $width2") -eq 1 ] then width1=$width2 fi if [ $(bc <<< "$height1 < $height2") -eq 1 ] then height1=$height2 fi

height=$( echo "scale=2; $height1 * $numpages" | bc ) if [ $(bc <<< "$MAX_HEIGHT < $height") -eq 1 ] then height=$MAX_HEIGHT fi papersize="${width1}pt,${height}pt"

#Make the scrolls:

pdfj="pdfjam --nup 1x$numpages --papersize {${papersize}} --outfile" $pdfj "$TMPFILE1" "$1" $pdfj "$TMPFILE2" "$2"

#diffpdf "$TMPFILE1" "$TMPFILE2" #diff-pdf on MacOS diff-pdf --output-diff diff.pdf "$TMPFILE1" "$TMPFILE2"

rm -f $TMPFILE1 $TMPFILE2

link to github gist

Dietrich
  • 101