[Tex/LaTex] A guide to understanding expandability: when to write protected functions and when not to

expansionlatex3

I'm having difficulty understanding (and appreciating) the concept of expandability. I'm very murky about understanding when and how expandability impacts me in writing code for my documents.

I've read Why isn't everything expandable?. The answer was interesting and useful, but it didn't get at the heart of what I'm curious about. I've also perused a number of the answers to other questions involving expandability: of particular interest was this post.

In respondence to a recent question of mine, it was explained that document commands are protected and hence not expandable. Understanding this allowed me to write what I wanted and to get the effect I expected.

And in a comment to another question of mine, it was explained how one should use \cs_new_protected:Npn "when the function does unexpandable jobs such as setting token lists or sequences."

For years, I've been writing code like

\newcommand{\currentanswer}{}
\newcommand{\setcurrentanswer}[1]{\renewcommand{\currentanswer}{#1}}

knowing that after calling \setcurrentanswer, any call to \currentanswer will result in the desired output. Am I relying upon (un)expandability here? I'm not really sure; I only know that it does what I want. Then there are times I know I can throw in a \protect to get the result I want: but, I really don't understand the why of it; I just know it gets the job done.

Recently, I've been trying to learn some LaTeX3: the more I play with it, the more I like it. LaTeX—which I always thought was pretty powerful—is suddenly much more powerful and transparent in the manner that macros and functions can be defined. But now, I also seem to be running up against this issue of expandability, whereas before I could blithely go about my business ignorant of some of the subtlies of what I was doing.

While I am asking multiple questions here, I suspect that they really have much the same answer: hence I'm not splitting them across multiple posts.

Could someone take the time to explain some of the nuances of expandability, or, if not, point me to a good reference?
How do I know when I'm working with a protected function/macro?
Is protected and unexpandable the same thing?
Could someone explain the preference for protected functions in LaTeX3?
And finally, apart from the answers to the above questions, why would it be preferrable to protect functions which perform unexpandable tasks: such as setting tokens and sequences? (I am very interested in understanding this last question.)

Best Answer

An expandable command is one which can be converted 'fully' into it's output inside a TeX \edef or \write (and a few other places). Thus for example

\def\testa{\testb}
\def\testb{\testc}
\def\testc{d}
\edef\teste{\testa}
\show\teste

will give

> \teste=macro:
->d.

i.e. all of the steps have been expanded, and we have just characters.

For text, this is nice and simple, but when you get TeX primitives involved things are more complex as some are expandable and some are not. Broadly, anything which performs an assignment is unexpandable. So if we have

\def\testa{\testb}
\def\testb{\testc}
\def\testc{\def\ARG{d}}
\def\ARG{}
\edef\teste{\testa}
\show\teste

we get

> \teste=macro:
->\def {d}.

Notice how the \def is left unchanged but the \ARG has vanished: it got expanded to what it is defined as (empty).

e-TeX allows us to define a protected macro. These do not expand inside an \edef, so

\def\testa{\testb}
\def\testb{\testc}
\def\testc{\def\NOTARG{d}}
\protected\def\NOTARG{}
\edef\teste{\testa}
\show\teste

now yields

> \teste=macro:
->\def \NOTARG {d}.

There is a subtle but important point here: \def is an unexpandable primitive, while \NOTARG is now a protected macro. You can tell that \NOTARG is protected using \show:

> \NOTARG=\protected macro:
->.

where the \protected tells us what we need to know. However, you have to know that \def is not expandable.

In the LaTeX3 documents, rather than expect people to learn the rules we've gone with a different approach: we document which functions are expandable (they are marked with a star). The reason everything else is then protected is that 'partial' expansion is a real issue. If you do

\def\testa{\let\testb\testc}
\edef\testb{\testa}

you get

! Undefined control sequence.
\testa ->\let \testb 
                     \testc

as \let is unaffected by the \edef but \testb is undefined. This gets worse when you look at 'real' documents, as the problem can be hidden many layers down.

Many of the issues people see in real LaTeX2e documents, for example where they forget \protect and have trouble, would be bypassed if most commands were protected. In general, you find a lot more (La)TeX code that is not expandable than code that is, so the position for LaTeX3 is that this is the exception, certainly for document commands. (Typesetting is not expandable, and that's what happens in documents.)

This leads us on to what I call the 'sheep and goats' approach to protected functions: all LaTeX3 code is either protected or fully expandable ['safe' (will give the expected result) inside \edef/x-type expansion], even if we are talking about auxiliary functions. The result is that we can always be sure if a function can be used in an expansion context: if it can, it's marked with a star, otherwise it will be protected and won't expand part way. So the 'correct' way to write LaTeX3 code is that if you use anything that is not expandable (i.e. not starred in the documentation) in your code, then you have to use \cs_new_protected:Npn or similar, and not \cs_new:Npn, etc.

A fully expandable sanitizer

The following is an implementation of a \Sanitize command that:

Completely removes all control sequences and balanced braces in its argument.
Does not choke on nested braces.
Keeps spaces where they were requested either by " " or "\ " (for after macros).
Is fully expandable (i.e. can be put in \edef or \csname).

Edit: This is a revised version. My initial code had a few minor bugs that were a major pain to fix, and this is substantially rewritten. I think it's clearer, too.

How it works

There are three states: sanitizing spaces, sanitizing groups, and sanitizing tokens. We scan for "words" one at a time, then within each "word" look for groups that might be hiding spaces (TeX's macro scanner will only absorb delimited arguments with balanced braces). Finally, once we are satisfied that we are looking at genuinely contiguous tokens, we scan one at a time and throw out the ones that are control sequences, leaving only explicitly specified spaces (" " or "\ ").

From the inside out, the operation looks like this:

\SanitizeTokens is a big nested conditional that tests its argument against the various special cases. During the sweep for spaces, all space characters were converted to \SanitizedSpace tokens, and they are now converted to \RealSpaces. Both \SanitizedSpace and \SanitizeStop are macros that expand to themselves, and since they are private, this means that testing against them via \ifx is a reliable way to detect the exact control sequences (in the first version, these were \countdef tokens, which have the same property but are not quite as private).
\SanitizeGroups uses the tricky \def\SanitizeGroups#1#{ construction discussed in this question: Macros with # as the last parameter. It is the most legitimate such use I can imagine: its point is to detect groups, which you can't do using plain macro expansion in any other way. It guarantees that #1 has no groups in it, and since this comes after space elimination, it also has no spaces in it, so we can run \SanitizeTokens straight away. We then "enter" the group and go back to eliminating spaces.
\SanitizeSpaces uses pattern matching to grab the first chunk of text up until a space, excluding of course those spaces that are in groups. There is a technical trick here: every use of this macro has {} right after it, before the text. The point of that is so that the argument scanner doesn't remove braces around a group constituting an entire "word" between spaces. If that happens, then we will erroneously treat it as though it's been cleared of spaces when, in fact, it has not. (Any unsanitized spaces would be eaten by \SanitizeTokens because argument scanning ignores spaces.)
There are of course some cute utility macros. My favorite is \IfNoGapToStop, which is called like this: \IfNoGapToStop.X. \SanitizeStop, with X being the quantity potentially containing a gap. If it has none, then the first gap is the visible space after the period; if it has a gap, then the two periods are in different components, and both arguments of \IfNoGapToStop are nonempty.

Aside from the structural changes from the previous version, this one correctly preserves spaces at the boundaries of groups. (That version didn't explicitly scan for groups, but eliminated them as a side effect of absorbing tokens. That works, but it also makes it impossible to be sure when you are looking at a group, which may have spaces, rather than a single token.)

Oh, and of course: the algorithm is no longer stupid. The last version rescanned the entire initial portion of the text repeatedly while looking for words (the point of that was so as not to "lose" those tokens before sanitizing them). Now I crawl through the words one at a time, so there's no problem with abandoning each one when looking for the next. That turns a quadratic algorithm into a linear one.

This is not my preferred way of writing TeX anymore (for that, you should read this answer: How to write readable commands) but pgfkeys is really not the tool for this kind of textual parsing.

\documentclass{article}

\makeatletter
\newcommand\Sanitize[1]{%
 \SanitizeSpaces{}#1 \SanitizeStop
}

% This loops through and replaces all spaces (outside brace groups) with \SanitizedSpace's.
% Then it goes for the control sequences.
% All calls to this should put a {} right before the content, to inhibit the gobbling of braces
% if there is a group right at the beginning.
\def\SanitizeSpaces#1 #2\SanitizeStop{%
 \IfEmpty{#2}% Last word
  {\IfEmpty{#1}% No content at all
   {}% Nothing to do
   {\SanitizeGroups#1{\SanitizeStop}}%
  }%
  % No need for a trailing space anymore: there's already one from the initial call
  {\SanitizeGroups#1\SanitizedSpace{\SanitizeStop}\SanitizeSpaces{}#2\SanitizeStop}%
}

% Sanitize tokens up to the next group, then go back to doing spaces.
\def\SanitizeGroups#1#{%
 \SanitizeTokens#1\SanitizeStop
 \EnterGroup
}

% Sanitize the next group from the top.
\newcommand\EnterGroup[1]{%
 \ifx\SanitizeStop#1%
  \expandafter\@gobble
 \else
  \expandafter\@firstofone
 \fi
 {\SanitizeSpaces{}#1 \SanitizeStop\SanitizeGroups}%
}

\newcommand\SanitizeTokens[1]{%
 \ifx\SanitizeStop#1%
 \else
  \ifx\SanitizedSpace#1%
   \RealSpace
  \else
   \ifx\ #1%
    \RealSpace
   \else
    \if\relax\noexpand#1%
    \else
     #1%
    \fi
   \fi
  \fi
  \expandafter\SanitizeTokens
 \fi
}

% We use TeX's proclivity to eat braces even for delimited arguments to eat the braces if #1 
% happens to be just {}, which we put in.
% Even if we didn't put it in, {} is going to get thrown out when \SanitizeSpaces gets to it.
\newcommand\IfEmpty[1]{%
 \IfOneTokenToStop.#1\SanitizeStop
  {% #1 has at most space tokens
   % and thus is nonempty if and only if there is a gap:
   \IfNoGapToStop.#1. \SanitizeStop
  }
  {% #1 has non-space tokens
   \@secondoftwo
  }%
}

% Checks for a gap in #1, meaning #2 is nonempty
% This should only be used with \IfEmpty
\def\IfNoGapToStop#1 #2\SanitizeStop{%
 % It's enough to check for one token, since #2 is never just spaces
 \IfOneTokenToStop.#2\SanitizeStop
}

\def\IfOneTokenToStop#1#2{% From \IfEmpty, #1 is always a .
 \ifx\SanitizeStop#2%
  % If #2 is multi-token, the rest of it will fall in the one-token case and be passed over.
  % If not, well, that's what we asked for.
  \expandafter\@firstoftwo
 \else
  \expandafter\GobbleToStopAndSecond
 \fi
}

\def\GobbleToStopAndSecond#1\SanitizeStop{%
 \@secondoftwo
}
\makeatother

\def\SanitizeStop{\SanitizeStop}
\def\SanitizedSpace{\SanitizedSpace}
\def\RealSpace{ }

\begin{document}
\setlength\parindent{0pt}\tt

% Torture test
\edef\a{%
 \Sanitize{ Word1 \macro{Word2 Word3}{\macro\ Word4}{ Word5} {Word6 }{}Word7{ }{{Word8}} }
}\meaning\a

\a
\medskip

% Examples
\edef\a{%
 \Sanitize{\emph{This} sentence has \TeX\ macros and {grouping}. }
}\meaning\a

\a
\medskip

\edef\a{%
 \Sanitize{{A}{ gratuitously {nested} sentence {}{{with many} layers}}.}
}\meaning\a

\a
\medskip

\end{document}

Is a latex3 function without a (hollow) star expandable

Yes, all functions without a star are (or at least should be) \protected, this way they don't expand on x or e type expansion or when written to a file.

f and c type expansion are a different and ignore \protected. But it is wrong to (intentionally) expand them (the functions without stars) with f-type expansion.

f-type expansion will expand everything until it hits the first unexpandable token (as already said ignoring \protected, so this means until it hits either an unexpandable primitive or a character token of categories 1, 2, 3, 4, 6, 7, 8, 11, or 12) or a space (the space would then be gobbled, the unexpandable token would stay there).

By the way, many of the hollow-starred functions aren't necessarily f-type safe, but could be, this depends on the user code being supplied. For instance \int_step_function:nN is not f-expansion safe, but a usage such as \int_step_function:nN {10} \use_none:n would be (though admittedly this is a nonsense example). The reason for this (the fact they are marked with a hollow star) is usually because they iterate over some input and for each element leave some code in the input stream, and that code could be unexpandable or not whatever, so there is no guarantee that it would always work out.

Best Answer

Related Solutions

[Tex/LaTex] Can one define an expandable command that removes control sequences from its argument

A fully expandable sanitizer

How it works

Is a latex3 function without a (hollow) star expandable

Related Question