[Tex/LaTex] Could someone further elucidate expansion, catcodes, and scantokens…

e-texexpansion

In response to my question "With TikZ is it possible to pass the node content through a preprocessor?", @MarkWibrow suggested a solution using \scantokens.

{\catcode`\_=13 \gdef_{\rule[-1pt]{0.75em}{1.0pt}}}
\def\pp#1{{\catcode`\_=13 \scantokens{#1\ignorespaces}}}

I've never understood why \scantokens would be useful or not, and I've seen plenty of warnings about the reputed dangers of \scantokens. The documentation in e-TeX is rather sparse (at least, too sparse to enlighten me at all). So, I decided to experiment a bit on my own and wrote my own version of \pp:

\def\aepp#1{{\catcode`\_=13 #1}}

But

 \aepp{hello_world}

results in an error

! Missing $ inserted.
<inserted text> 
                $

I thought, initially, that I perhaps didn't understand something about how \catcode works and tried the following MWE:

\documentclass{article}
{\catcode`\_=13 \gdef_{\rule[-1pt]{0.75em}{1.0pt}}}
\setlength\parindent{0pt}
\begin{document}
{\catcode`\_=13 hello_world}
\end{document}

which compiles without error.

I find this a bit confusing, because isn't

{\catcode`\_=13 hello_world}

what \aepp{hello_world} expands to?

So, while I'm quite a bit in the dark about \scantokens, I'm more curious about why expansion of \aepp{hello_world} failed to accomplish what \pp{hello_world} does.

Best Answer

TeX's scanner (eyes) convert characters in a file to tokens. That only happens once, macro replacement text and all expansion processing processes tokens (which are [character-code,catcode] pairs.

the catcode table affects the conversion of characters to character tokens.

So your definition locally makes the catcode of _ 13 so if a _ character is encounted in that scope it would tokenize as a (_,13) token. But you pass it a token list as #1 of tokens that have already been tokenized so the catcode table is never consulted, so your change to the table is not used.

\scantokens acts as if the tokens are written to a file producing a stream of characters which are then read back so retokenised using the current catcode table.

Without \scantokens the classic construct you are looking for is:

\documentclass{article}
{\catcode`\_=13 \gdef_{\rule[-1pt]{0.75em}{1.0pt}}}
\setlength\parindent{0pt}
\begin{document}
{\catcode`\_=13 hello_world}

\def\aepp{\bgroup\catcode`\_=13 \xaepp}
\def\xaepp#1{#1\egroup}
\aepp{hello_world}


\end{document}

which changes the catcode of _ before the argument is read. But this (like \verb not coincidentally) does not work if nested in the argument of another command, as then again the argument has already been tokenised.

Related Question