[Tex/LaTex] Reproducible LaTeX builds – compile to a file which always hashes to the same value

compilingpdfSecurity

I am interested in using LaTeX in a way such that when I compile twice, I get the very same resulting file.

My test.tex:

\documentclass{article}
\begin{document}
Hello, World!
\end{document}

Compiling yields different hashes:

8493b40b225993d01d46ed7479725d8b4e9f6efbfddcc8d6d657f00084d41cdb  test.pdf
05f2a3cd3780df33470a4363da18b008595e42acd9a085d76c83b6c83dc71c41  test.pdf

and so on. (This also applies when compiling to DVI at least a minute apart.)

My guess is that this is, at least, due to the Created and Modified metadata of the PDF. I followed this answer for fixing those, but I still get different hashes.

I found out that when compiling with faketime '2008-12-24 08:15:42' pdflatex test the file will be reproducible. I conclude that there is no random data involved, but it's only dependent on the time.

My question is thus, can I influence that time for pdflatex from within my TeX document?

Best Answer

Since TeX Live 2016, there are a couple of options to achieve reproducible builds:

pdfTeX

For pdfTeX (version ≥1.40.17), there are three new primitives:

  • \pdfinfoomitdate, which removes the /CreationDate and /ModDate entries in the document info dictionary. By default, these entries would be set to the date when the document was compiled. They could already be modified in older versions of pdfTeX using \pdfinfo{/CreationDate (...)} /ModDate (...)}, but with \pdfinfoomitdate=1, they can be removed completely from the resulting PDF file.
  • \pdftrailerid, which sets the file identifier of the PDF document in the /ID document info entry as described in the PDF specification, section 10.3. By default, it is computed by hashing the current date and time (even if \pdfinfoomitdate=1 is used) and the full path of the output file. By including \pdftrailerid{string} with a fixed string in your document, the hash of this string is used as the identifier instead. Leaving it blank like \pdftrailerid{} completely removes the /ID entry.
  • \pdfsuppressptexinfo controls some additional metadata written to the document: firstly, pdfTeX usually creates an entry PTEX.Fullbanner containing the full version string as seen in the output of pdftex --version. Furthermore, for every PDF image included in your document, some additional metadata is written. Suppressing these entries is not strictly necessary for reproducible builds, but might help if you want to compile the same document on different systems. It can be done by issuing \pdfsuppressptexinfo=-1.

TL; DR: So the easiest way to get reproducible PDF output is to use

\documentclass{article}
\pdfinfoomitdate=1
\pdftrailerid{}
\begin{document}
Hello, World!
\end{document}

 

LuaTeX

Since TeX Live 2017, LuaTeX (version ≥1.0.4) also supports these features, albeit with a little different syntax:

  • \pdfvariable suppressoptionalinfo prevents certain metadata do be included in the resulting PDF file, similar to \pdfsuppressptexinfo in pdfTeX, but with more options:

    \pdfvariable suppressoptionalinfo \numexpr
            0
        +   1   % PTEX.FullBanner
        +   2   % PTEX.FileName
        +   4   % PTEX.PageNumber
        +   8   % PTEX.InfoDict
        +  16   % Creator
        +  32   % CreationDate
        +  64   % ModDate
        + 128   % Producer
        + 256   % Trapped
        + 512   % ID
    \relax
    
  • \pdfvariable trailerid lets you specify your own file identifier like \pdftrailerid does, but you have to get the syntax right yourself, so I recommend simply suppressing the ID using the above command instead.

TL; DR: For reproducible builds in LuaLaTeX, use

\documentclass{article}
\pdfvariable suppressoptionalinfo \numexpr32+64+512\relax
\begin{document}
Hello, World!
\end{document}

 

XeTeX

Since TeX Live 2019, XeTeX supports specifying the file identifier:

  • pdf:trailerid is a \special command recognized by dvipdfmx, which is used by XeTeX to produce PDF files. The value format is the same as for the \pdfvariable trailerid in LuaTeX: a raw PDF array of two PDF strings. Both strings must be 16 bytes. The dvipdfmx doc gives an example with literal strings (specified between parentheses). Another example is with a 16-byte hex string (specified between brackets <[…]>) that could be an MD5 hash identifying the document:

    \special{pdf:trailerid [
        <00112233445566778899aabbccddeeff>
        <00112233445566778899aabbccddeeff>
    ]}
    

 

All major engines (pdfTeX, LuaTeX, XeTeX)

As an alternative, pdfTeX, LuaTeX and XeTeX support SOURCE_DATE_EPOCH:

If you set the SOURCE_DATE_EPOCH environment variable to a certain date (in the form of a Unix timestamp, as produced e.g. by the output of date +%s), this date is used instead of the current date. Setting it to a fixed date therefore lets you create reproducible PDF files without any changes to the LaTeX source code. Keep in mind however that the output file name (for pdfTeX and LuaTeX including the full path, for XeTeX only the name itself) is still used to compute the file identifier described above: so if you move or rename your LaTeX document and compile, the resulting PDF document will change.

Related Question