[Tex/LaTex] Could someone further elucidate expansion, catcodes, and scantokens…

e-texexpansion

In response to my question "With TikZ is it possible to pass the node content through a preprocessor?", @MarkWibrow suggested a solution using \scantokens.

{\catcode`\_=13 \gdef_{\rule[-1pt]{0.75em}{1.0pt}}}
\def\pp#1{{\catcode`\_=13 \scantokens{#1\ignorespaces}}}

I've never understood why \scantokens would be useful or not, and I've seen plenty of warnings about the reputed dangers of \scantokens. The documentation in e-TeX is rather sparse (at least, too sparse to enlighten me at all). So, I decided to experiment a bit on my own and wrote my own version of \pp:

\def\aepp#1{{\catcode`\_=13 #1}}

But

 \aepp{hello_world}

results in an error

! Missing $ inserted.
<inserted text> 
                $

I thought, initially, that I perhaps didn't understand something about how \catcode works and tried the following MWE:

\documentclass{article}
{\catcode`\_=13 \gdef_{\rule[-1pt]{0.75em}{1.0pt}}}
\setlength\parindent{0pt}
\begin{document}
{\catcode`\_=13 hello_world}
\end{document}

which compiles without error.

I find this a bit confusing, because isn't

{\catcode`\_=13 hello_world}

what \aepp{hello_world} expands to?

So, while I'm quite a bit in the dark about \scantokens, I'm more curious about why expansion of \aepp{hello_world} failed to accomplish what \pp{hello_world} does.

Best Answer

TeX's scanner (eyes) convert characters in a file to tokens. That only happens once, macro replacement text and all expansion processing processes tokens (which are [character-code,catcode] pairs.

the catcode table affects the conversion of characters to character tokens.

So your definition locally makes the catcode of _ 13 so if a _ character is encounted in that scope it would tokenize as a (_,13) token. But you pass it a token list as #1 of tokens that have already been tokenized so the catcode table is never consulted, so your change to the table is not used.

\scantokens acts as if the tokens are written to a file producing a stream of characters which are then read back so retokenised using the current catcode table.

Without \scantokens the classic construct you are looking for is:

\documentclass{article}
{\catcode`\_=13 \gdef_{\rule[-1pt]{0.75em}{1.0pt}}}
\setlength\parindent{0pt}
\begin{document}
{\catcode`\_=13 hello_world}

\def\aepp{\bgroup\catcode`\_=13 \xaepp}
\def\xaepp#1{#1\egroup}
\aepp{hello_world}


\end{document}

which changes the catcode of _ before the argument is read. But this (like \verb not coincidentally) does not work if nested in the argument of another command, as then again the argument has already been tokenised.

Related Solutions

[Tex/LaTex] Expandable full expansion of tokens that preserves catcodes

Did you try using \romannumeral? This is used a lot for this type of thing (see for example the \exp_args:Nf concept in expl3):

\def\fullyexpand#1{\romannumeral - `0#1}

This works because TeX will keep expanding #1 looking for a number, which will always turn out to be negative, so the Roman numeral will vanish. Note that this solution will stop on the first non-expandable token, unlike an \edef which will keep going.

It's possible to build a function which can expand using \romannumeral 'around' unexpandable tokens. For example, the following code will work reasonably well:

\long\def\fullyexpand#1{%
  \csname donothing\fullyexpandauxi{#1}{}%
}
\long\def\fullyexpandauxi#1{%
  \expandafter\fullyexpandauxii\romannumeral -`0#1\fullyexpandend
}
\long\def\fullyexpandauxii#1#2\fullyexpandend#3{%
  \ifx\donothing#2\donothing
    \expandafter\fullyexpandend
  \else
    \expandafter\fullyexpandloop
  \fi
  {#1}{#2}{#3}%
}
\long\def\fullyexpandend#1#2#3{\endcsname#3#1}
\long\def\fullyexpandloop#1#2#3{%
  \fullyexpandauxi{#2}{#3#1}%
}
\def\donothing{}

However, this is not the same as \expanded, for a few reasons. First, my implementation will strip out spaces in the argument (as it does a loop, and TeX will skip spaces). Braces will also get stripped out. A bit of testing also reveals that \romannumeral will expand \protected functions here, whereas \expanded does not. I'd also note that the above code needs some guards adding for a blank (empty or all space) argument, as currently things fail in these cases.

With current release LuaTeX one can use \expanded, which does more-or-less the same as an \edef but is expandable (it doesn't required doubled # tokens also). This primitive will be in TeX Live 2019 pdfTeX/e-pTeX/e-upTeX, and hopefully in XeTeX (yet to be confirmed). As a precursor to this, expl3 has a macro-based emulation, slow but working, which does token-by-token examination and allows 'e-type' expansion.

On the aside, it is possible to use \scantokens expandably, but as you may have found this can be tricky and it is usually necessary to have a (non-expandable) change of \everyeof first. LuaTeX addresses this issue with the \scantextokens primitive, which combines this end-of-file stuff directly into the primitive. Of course, if you are using LuaTeX then the original problem is solvable anyway, since \expanded is available.

[Tex/LaTex] Use of \everyeof and \endlinechar with \scantokens

The \scantokens primitive is described in the e-TeX manual as working in a similar manner to the following code:

\toks0={...}% '...' is the rescanned material
\immediate\openout0=file
\immediate\write0{\the\toks0}
\immediate\closeout0
\input file

but without the use of files and in an expandable manner. However, it does use the some of the same internals as the above. This has a consequence for using the primitive.

A pseudo-file is 'read' by TeX, and this is treated as having an end-of-file (EOF) marker. \scantokens tries to read this as a token, but that will raise an error, for example

! File ended while scanning definition of \demo

with code

\edef\demo{\scantokens{foo}}

To prevent this, you need to set \everyeof to insert a \noexpand before this marker:

\everyeof{\noexpand}
\edef\demo{\scantokens{foo}}

TeX then does not try to read past the end of the file and this error is avoided.

The second issue is that TeX tokenizes the 'end of line' characters in the normal way inside \scantokens. The common use is to have a single line scanned, as above, but the result will not be as might be expected:

\everyeof{\noexpand}
\edef\demo{\scantokens{foo}}
\show\demo

yields

> \demo=macro:
->foo .

with an additional space: the final 'end of line' (end of the pseudo-file) is converted to a space. To prevent this, you normally alter the end-of-line behaviour with

\endlinechar=-1

so that the end-of-line is ignored and no space is added.

It's then standard to wrap everything up in a group, for example when saving the result in a macro

\long\def\safescantokens#1#2{%
  \begingroup
    \everyeof{\noexpand}%
    \endlinechar=-1
    \xdef#1{\scantokens{#2}}%
  \endgroup
}
\safescantokens\demo{foo}

The group is used here so that the two additional steps don't affect any other code, while \xdef is the simplest way to get the result outside of the group. (An appropriate \expandafter chain is also a possible approach for that.)

All of this makes the resulting use non-expandable, which somewhat defeats the point of the primitive (although files are still not used). As a result, in LuaTeX there is a \scantextokens primitive which specifically addresses these issues: the end-of-file is ignored and no end line character is inserted after the last line (which is almost always the only line).

Best Answer

Related Solutions

[Tex/LaTex] Expandable full expansion of tokens that preserves catcodes

[Tex/LaTex] Use of \everyeof and \endlinechar with \scantokens

Related Question