[Tex/LaTex] Calculate the hash (MD5 or otherwise) of a string

macrosprogramming

I'm trying to cache the results of a macro, similar to that question, but the argument of the macro can contain arbitrary characters and it's not suitable to be inside \csname ... \endcsname.

So, I'm wondering: surely some of the TeX core or packages contain functionality to calculate a hash, MD5 or otherwise (I don't really care which). But… I can't find it. The only result I've got from grep'ing my TeXlive tree is pdfmdfivesum, but it only works on files, not strings.

So: are there ready-made hash calculation macros/packages available somewhere?

Best Answer

\pdfmdfivesum also works on arbitrary strings:

\pdfmdfivesum{Hello World}

Result:

B10A8DB164E0754105B7A99BE72E3FE5

The hex string can be decoded to save space:

\pdfunescapehex{\pdfmdfivesum{Hello World}}

\pdfmdfivesum is expandable and can be used inside \edef.

\pdfmdfivesum works on file only, when the keyword file is given:

\pdfmdfivesum file {<filename>}

Package `pdftexcmds`

\pdfmdfivesum is available in pdfTeX in both modes DVI and PDF. Package pdftexcmds defines missing pdfTeX primitives in LuaTeX. The package also works in plain TeX (\input pdftexcmds.sty). The command names are using the prefix \pdf@ instead of \pdf:

\pdf@mdfivesum{Hello World}

XeTeX

Older versions, e.g., XeTeX (3.14159265-2.6-0.99992), do not support MD5 sums.
\pdffivesumm was added around version 0.99993 from pdfTeX and later renamed to \mdfivesum. Thus, the current version (3.14159265-2.6-0.99996) calculates MD5 sums via \mdfivesum. Also keyword file is supported as in pdfTeX.

A fully expandable sanitizer

The following is an implementation of a \Sanitize command that:

Completely removes all control sequences and balanced braces in its argument.
Does not choke on nested braces.
Keeps spaces where they were requested either by " " or "\ " (for after macros).
Is fully expandable (i.e. can be put in \edef or \csname).

Edit: This is a revised version. My initial code had a few minor bugs that were a major pain to fix, and this is substantially rewritten. I think it's clearer, too.

How it works

There are three states: sanitizing spaces, sanitizing groups, and sanitizing tokens. We scan for "words" one at a time, then within each "word" look for groups that might be hiding spaces (TeX's macro scanner will only absorb delimited arguments with balanced braces). Finally, once we are satisfied that we are looking at genuinely contiguous tokens, we scan one at a time and throw out the ones that are control sequences, leaving only explicitly specified spaces (" " or "\ ").

From the inside out, the operation looks like this:

\SanitizeTokens is a big nested conditional that tests its argument against the various special cases. During the sweep for spaces, all space characters were converted to \SanitizedSpace tokens, and they are now converted to \RealSpaces. Both \SanitizedSpace and \SanitizeStop are macros that expand to themselves, and since they are private, this means that testing against them via \ifx is a reliable way to detect the exact control sequences (in the first version, these were \countdef tokens, which have the same property but are not quite as private).
\SanitizeGroups uses the tricky \def\SanitizeGroups#1#{ construction discussed in this question: Macros with # as the last parameter. It is the most legitimate such use I can imagine: its point is to detect groups, which you can't do using plain macro expansion in any other way. It guarantees that #1 has no groups in it, and since this comes after space elimination, it also has no spaces in it, so we can run \SanitizeTokens straight away. We then "enter" the group and go back to eliminating spaces.
\SanitizeSpaces uses pattern matching to grab the first chunk of text up until a space, excluding of course those spaces that are in groups. There is a technical trick here: every use of this macro has {} right after it, before the text. The point of that is so that the argument scanner doesn't remove braces around a group constituting an entire "word" between spaces. If that happens, then we will erroneously treat it as though it's been cleared of spaces when, in fact, it has not. (Any unsanitized spaces would be eaten by \SanitizeTokens because argument scanning ignores spaces.)
There are of course some cute utility macros. My favorite is \IfNoGapToStop, which is called like this: \IfNoGapToStop.X. \SanitizeStop, with X being the quantity potentially containing a gap. If it has none, then the first gap is the visible space after the period; if it has a gap, then the two periods are in different components, and both arguments of \IfNoGapToStop are nonempty.

Aside from the structural changes from the previous version, this one correctly preserves spaces at the boundaries of groups. (That version didn't explicitly scan for groups, but eliminated them as a side effect of absorbing tokens. That works, but it also makes it impossible to be sure when you are looking at a group, which may have spaces, rather than a single token.)

Oh, and of course: the algorithm is no longer stupid. The last version rescanned the entire initial portion of the text repeatedly while looking for words (the point of that was so as not to "lose" those tokens before sanitizing them). Now I crawl through the words one at a time, so there's no problem with abandoning each one when looking for the next. That turns a quadratic algorithm into a linear one.

This is not my preferred way of writing TeX anymore (for that, you should read this answer: How to write readable commands) but pgfkeys is really not the tool for this kind of textual parsing.

\documentclass{article}

\makeatletter
\newcommand\Sanitize[1]{%
 \SanitizeSpaces{}#1 \SanitizeStop
}

% This loops through and replaces all spaces (outside brace groups) with \SanitizedSpace's.
% Then it goes for the control sequences.
% All calls to this should put a {} right before the content, to inhibit the gobbling of braces
% if there is a group right at the beginning.
\def\SanitizeSpaces#1 #2\SanitizeStop{%
 \IfEmpty{#2}% Last word
  {\IfEmpty{#1}% No content at all
   {}% Nothing to do
   {\SanitizeGroups#1{\SanitizeStop}}%
  }%
  % No need for a trailing space anymore: there's already one from the initial call
  {\SanitizeGroups#1\SanitizedSpace{\SanitizeStop}\SanitizeSpaces{}#2\SanitizeStop}%
}

% Sanitize tokens up to the next group, then go back to doing spaces.
\def\SanitizeGroups#1#{%
 \SanitizeTokens#1\SanitizeStop
 \EnterGroup
}

% Sanitize the next group from the top.
\newcommand\EnterGroup[1]{%
 \ifx\SanitizeStop#1%
  \expandafter\@gobble
 \else
  \expandafter\@firstofone
 \fi
 {\SanitizeSpaces{}#1 \SanitizeStop\SanitizeGroups}%
}

\newcommand\SanitizeTokens[1]{%
 \ifx\SanitizeStop#1%
 \else
  \ifx\SanitizedSpace#1%
   \RealSpace
  \else
   \ifx\ #1%
    \RealSpace
   \else
    \if\relax\noexpand#1%
    \else
     #1%
    \fi
   \fi
  \fi
  \expandafter\SanitizeTokens
 \fi
}

% We use TeX's proclivity to eat braces even for delimited arguments to eat the braces if #1 
% happens to be just {}, which we put in.
% Even if we didn't put it in, {} is going to get thrown out when \SanitizeSpaces gets to it.
\newcommand\IfEmpty[1]{%
 \IfOneTokenToStop.#1\SanitizeStop
  {% #1 has at most space tokens
   % and thus is nonempty if and only if there is a gap:
   \IfNoGapToStop.#1. \SanitizeStop
  }
  {% #1 has non-space tokens
   \@secondoftwo
  }%
}

% Checks for a gap in #1, meaning #2 is nonempty
% This should only be used with \IfEmpty
\def\IfNoGapToStop#1 #2\SanitizeStop{%
 % It's enough to check for one token, since #2 is never just spaces
 \IfOneTokenToStop.#2\SanitizeStop
}

\def\IfOneTokenToStop#1#2{% From \IfEmpty, #1 is always a .
 \ifx\SanitizeStop#2%
  % If #2 is multi-token, the rest of it will fall in the one-token case and be passed over.
  % If not, well, that's what we asked for.
  \expandafter\@firstoftwo
 \else
  \expandafter\GobbleToStopAndSecond
 \fi
}

\def\GobbleToStopAndSecond#1\SanitizeStop{%
 \@secondoftwo
}
\makeatother

\def\SanitizeStop{\SanitizeStop}
\def\SanitizedSpace{\SanitizedSpace}
\def\RealSpace{ }

\begin{document}
\setlength\parindent{0pt}\tt

% Torture test
\edef\a{%
 \Sanitize{ Word1 \macro{Word2 Word3}{\macro\ Word4}{ Word5} {Word6 }{}Word7{ }{{Word8}} }
}\meaning\a

\a
\medskip

% Examples
\edef\a{%
 \Sanitize{\emph{This} sentence has \TeX\ macros and {grouping}. }
}\meaning\a

\a
\medskip

\edef\a{%
 \Sanitize{{A}{ gratuitously {nested} sentence {}{{with many} layers}}.}
}\meaning\a

\a
\medskip

\end{document}

[Tex/LaTex] Using string variable with LaTeX

You can use a standard \renewcommand to modify the text:

enter image description here

\documentclass{article}
\newcommand{\someSpecialText}{A text that can be changed locally\ldots}
\newcommand{\testIt}{The value of \texttt{someSpecialText} is : ``\someSpecialText''.}

\begin{document}

% The default behavior.
\testIt

% Here I would like to change the definition of `\someSpecialText`
% or anywhere else in the document.
{% Start of group
\renewcommand{\someSpecialText}{A text that \textit{was} changed locally\ldots}
\testIt
}% End of group

% The default behavior.
\testIt

\end{document}

In the above example, the \renewcommand was placed inside a group, defined by the braces { and } to localize the redefinition. If this behaviour is not desired, simply remove the braces to make the redefinition global from that point forward.

Of course, one can also write this in a macro form, which could be considered "shorter", with some default value. Here is an example:

enter image description here

\documentclass{article}
\newcommand{\testIt}[1][A text that can be changed locally\ldots]{The value passed to \texttt{testIt} is : ``#1''.}

\begin{document}

% The default behavior.
\testIt

% Here I would like to change the definition of `\someSpecialText`
% or anywhere else in the document.
\testIt[A text that \textit{was} changed locally\ldots]

% The default behavior.
\testIt

\end{document}

\testIt is set up to output something with a default value. However, you can supply an optional argument that modifies the default behaviour.

Best Answer

Package pdftexcmds

XeTeX

Related Solutions

[Tex/LaTex] Can one define an expandable command that removes control sequences from its argument

A fully expandable sanitizer

How it works

[Tex/LaTex] Using string variable with LaTeX

Related Question

Package `pdftexcmds`