[Tex/LaTex] How to track changes between PDFs

changespdftools

I've got a modestly large document (60 pages or so), of technical documentation, in two versions (one of them improved). Unfortunately, I seem to have lost the source of the older version (but I do have the code for the most up-to-date document).

I'd like to compare the two versions of the document (in PDF format) and see what got changed between the two, preferably in a diff-like format.

Not a whole lot changed between the two versions, but minor details are important and could be easily overlooked.

Suggestions how to to this efficiently and effectively will be deeply appreciated. Non-TeX related solutions also welcome, including non-free ones.

I should mention that the documents contain exclusively text, with very few figures and tables, and modest amount of formulae.

Edit:
The documents were created with pdfLaTeX by myself, and if you think there's relevant information in the packages I used, I can provide that as well.

Edit 2: A lot of good and promising answers! I will make an effort to try them all one of these days, and will post an update here with the result. This might make a good community wiki.

Best Answer

I've made a quick comparison between some of the methods suggested here. This answer will be a community wiki, and I will made this the accepted one. Hopefully nobody will mind the loss of 15 rep. I upvoted all useful answers, and I urge you to do the same -- all of them were ideas with potential.

Adobe Acrobat Professional -- Compare files

I've used Acrobat 8 for this test (this is probably the first time I thank the Powers-That-Be that my organization is Windows-mostly shop). Setting this up is easy enough -- Go to Advanced->Compare documents..., and fill in the blanks. That was easy, I thought. Well..

Pro tip -- Don't choose Detailed analysis (slow). You'll thank me later. I did the first time around, and this thing had been running for 10 minutes, and let's just say that Acrobat is now the proud owner of the World Prize of Ridiculous Memory Consumption, with the whopping 750 (seven hundred and fifty, not a typo) MB. This only reminded me there's a reason I don't use Acrobat except for proof-reading just before printing.

I tried side-by-side report with Normal analysis, and that could have worked well if only my revised version didn't have 8 pages or so more than the original. It didn't recognize that the same text is somewhere down in the document -- as far as I can tell, it just put the two documents side by side, more or less, with some fancy useless colouring. Oh dear. I could have done that myself without whipping out Liberia's deficit for a license. At this point I wasn't inclined to try again with the detailed analysis, which supposedly would detect such things.

I'd give it a -1/10, for having an useless option that doesn't really work, and spectacularly so.

Update: Geoffrey had different experience with Acrobat in what seems a similar document, and I tried to repeat what he did. On the first try I have used the Page by page comparison option, which was so unsatisfactory, while now I tried the Textual differences. This works as he suggests, although the diff result is kind of useless if one chooses the Consolidated report option, and still difficult to interpret with the Side by side report . It still does not show side-to-side the equal text portions in each document, as one would expect when accustomed to diff format, but rather highliting differently the common text, and text unique to each version. At least, there's a comment on which page of the other document you'd find the matching text, so that's useful, although not exactly user-friendly. Also, I noticed it got confused in some places, matching seemingly random words and word fragments.

This would improve significantly the score of Acrobat, and I think a 7/10 is appropriate, with two whole points deducted for the non-trivial license fee -- the formula I use is:

licensePenalty = max(0, len(str(licenseFeeInUSD))-1).

Otherwise, it works good and the performance is also similar to what Geoffrey observed.
Adobe Acrobat Professional -- Export in .txt format. + diff

This works kind of. You'd get most of the text right, if it wasn't for the annoying mangling of ligatured glyphs. Also, my text is in Swedish, and has a decent amount of diacritics which also got lost. Hyphenation sneaked through as well (rather annoying -- Swedes have very long words sometimes). The formatting is abysmal, but could probably be fixed with an intelligent $FAVOURITE_INTERPRETED_LANGUAGE script.

I'd give a 6/10 for the effort, but only because diffing acually works. The text is not quite readable at places, for example words like träff looks like tr".
diffpdf

I found this little gem when looking for diff-like programs to install on Ubuntu. Available from http://www.qtrac.eu/diffpdf.html. Needs Qt and Poppler.

This is actually superior to Acrobat's Normal analysis mode -- the differences are nicely highlighted and obvious. It does page-by-page comparison as Acrobat, and there seems to be a way to make better comparisons if you know where you've inserted additional pages, but that wasn't very straight-forward to do, and I couldn't be bothered to look through the document to find which pages exactly were added -- that's kind of the point of using a tool to do it, no?

I'd give it a 4/10 for this particular problem, although for others with small changes it will work great, and would deserve an 8/10 (the user interface could be a bit confusing).
pdftotext+diff

For those that don't know, pdftotext is part of the xpdf collection, available from here: http://www.foolabs.com/xpdf/home.html. I used the Linux version on Ubuntu.

This works better than (2), but ligatured glyphs are substituted with what looks like UTF-8 symbols representing them, like fi, ff, ffl, etc. Quotes got mangled as well, again replaced with an UTF symbol (when writing, I always use the "proper" TeX `` and '' quotes). Text search works perfectly, though, even when using such combinations. The readability is much better, if your favourite text editor understands and renders UTF-8, and the formatting is improved, albeit slightly. Hyphenation, however, got taken care of, which is quite nice. One annoying thing, though -- headers and footers, together with page numbers, find their way into the text document, which could be frustrating when comparing the versions.

This one deserves a hard 9/10, with points deducted for UTF-8 symbols mangling, and the header and footer issue during conversion (the last creates a lot of "false positives" for diff).

Related Solutions

[Tex/LaTex] How to use LaTeX from Python.

Recently I've written a library exactly for this purpose. It supports tables, plots, matrices and more. https://github.com/JelteF/PyLaTeX

[Tex/LaTex] How to create small PDF files for the Internet

There are a number of tricks for getting optimized pdfs. Many of them are implemented in the tool pdfsizeopt. With some patches (posted in the pdfsizeopt bugtracker) this tool can run on all my tex-generated pdfs (and nearly all of the non-tex-generated ones). I use the commandline:

python ./pdfsizeopt.py --use-pngout=true --use-jbig2=true --use-multivalent=true --do-unify-fonts=false filetocompress.pdf

I use --do-unify-fonts=false even though it produces slightly larger pdfs, because of a bug where a few glyphs are not displayed with certain pdf viewers (windows adobe reader, for example).

There are indeed various things you can do during document production with tex, to make sure that the compressed pdf ends up as small as possible: several of these are discussed in the EuroTeX 2009 White paper about pdfsizeopt (available at https://github.com/pts/pdfsizeopt/releases/download/docs-v1/pts_pdfsizeopt2009.psom.pdf).

As regards fonts, pdfsizeopt will recode fonts to the very compressed CFF format, and take care of subsetting and duplication issues. I haven't investigated deeply, but in my tests it seems that of the 2 options for type 1 encoded T1 (multilingual) tex fonts, the Latin Modern fonts generally produce significantly larger PDFs than the CM-Super version (which is unfortunate, because Latin Modern is superior in just about every other way (see this question). I just did a quick experiment and this difference in size seems to be only for the pre-pdfsizeopt pdfs: after pdfsizeopt, Latin Modern is the same or smaller than CM-Super.

Using fonts that don't have optical scaling will indeed produce a smaller PDF, but I don't recommend it because if you are using multiple sizes then the non-optically scaled fonts will look much worse.

Best Answer

Related Solutions

[Tex/LaTex] How to use LaTeX from Python.

[Tex/LaTex] How to create small PDF files for the Internet

Related Question