I am trying to create an expandable command that accepts a single argument that may contain control sequences, and expands to that same argument with all control sequences and braces removed. That is:
\StripControlSequences{John Q. Author, \textit{Book Title}}
should expand to merely:
John Q. Author, Book Title
Alternatively, if I could designate the control sequences that get stripped out, such as \textit
, \textbf
, etc. that would be reasonable, as well.
If I didn't care about expandability, this would be easy. I have a macro using xstring that strips out arbitrary control characters. If I just wanted to get rid of the formatting from \textit
during execution, I could do something like this:
\def\StripControlSequences#1{{%
\let\textit=\relax%
#1%
}}
Unfortunately, this is not expandable; from my understanding, any macro that itself uses assignments (including \let
and \def
) can never be expanded. In this case, I need to use the output of \StripControlSequences
in something like the following:
\def\Def#1#2{\expandafter\def\csname\StripControlSequences{#1}\endcsname{#2}}
such that the command sequence could later be called, without the control sequences, like the following:
\Def{John Q. Author, \textit{Book Title}}{full citation string}
\begin{document}
\csname John Q. Author, Book Title\endcsname
\end{document}
which would produce a document with only:
full citation string
(of course, users of my macro package wouldn't call \csname ... \endcsname
directly, but you get the point). In case anyone is wondering about the context, I have a macro package for producing automated legal citations, and given that my target audience is non-technical, I need to try to make the interface as simple as possible. (I'm happy to explain more about the broader context, if necessary).
I've hunted for a while for any good answers to this without luck; but I apologize if this has been asked before! I basically understand the problems with fragile/robust commands, but it seems to me that no combination of \protect
, etc. will be applicable here, because \protect
merely delays expansion until later execution, so as to allow its argument to be moved. On the other hand, here I basically want execution at all times.
So perhaps another way to ask this question is: is it possible to force execution, like an \edef
that fully executes its argument, instead of merely expanding it?
Best Answer
I am pleased to be able to teach Martin Scharrer something he didn't know :)
A fully expandable sanitizer
The following is an implementation of a
\Sanitize
command that:Completely removes all control sequences and balanced braces in its argument.
Does not choke on nested braces.
Keeps spaces where they were requested either by " " or "\ " (for after macros).
Is fully expandable (i.e. can be put in
\edef
or\csname
).Edit: This is a revised version. My initial code had a few minor bugs that were a major pain to fix, and this is substantially rewritten. I think it's clearer, too.
How it works
There are three states: sanitizing spaces, sanitizing groups, and sanitizing tokens. We scan for "words" one at a time, then within each "word" look for groups that might be hiding spaces (TeX's macro scanner will only absorb delimited arguments with balanced braces). Finally, once we are satisfied that we are looking at genuinely contiguous tokens, we scan one at a time and throw out the ones that are control sequences, leaving only explicitly specified spaces (" " or "\ ").
From the inside out, the operation looks like this:
\SanitizeTokens
is a big nested conditional that tests its argument against the various special cases. During the sweep for spaces, all space characters were converted to\SanitizedSpace
tokens, and they are now converted to\RealSpace
s. Both\SanitizedSpace
and\SanitizeStop
are macros that expand to themselves, and since they are private, this means that testing against them via\ifx
is a reliable way to detect the exact control sequences (in the first version, these were\countdef
tokens, which have the same property but are not quite as private).\SanitizeGroups
uses the tricky\def\SanitizeGroups#1#{
construction discussed in this question: Macros with # as the last parameter. It is the most legitimate such use I can imagine: its point is to detect groups, which you can't do using plain macro expansion in any other way. It guarantees that#1
has no groups in it, and since this comes after space elimination, it also has no spaces in it, so we can run\SanitizeTokens
straight away. We then "enter" the group and go back to eliminating spaces.\SanitizeSpaces
uses pattern matching to grab the first chunk of text up until a space, excluding of course those spaces that are in groups. There is a technical trick here: every use of this macro has{}
right after it, before the text. The point of that is so that the argument scanner doesn't remove braces around a group constituting an entire "word" between spaces. If that happens, then we will erroneously treat it as though it's been cleared of spaces when, in fact, it has not. (Any unsanitized spaces would be eaten by\SanitizeTokens
because argument scanning ignores spaces.)There are of course some cute utility macros. My favorite is
\IfNoGapToStop
, which is called like this:\IfNoGapToStop.X. \SanitizeStop
, withX
being the quantity potentially containing a gap. If it has none, then the first gap is the visible space after the period; if it has a gap, then the two periods are in different components, and both arguments of\IfNoGapToStop
are nonempty.Aside from the structural changes from the previous version, this one correctly preserves spaces at the boundaries of groups. (That version didn't explicitly scan for groups, but eliminated them as a side effect of absorbing tokens. That works, but it also makes it impossible to be sure when you are looking at a group, which may have spaces, rather than a single token.)
Oh, and of course: the algorithm is no longer stupid. The last version rescanned the entire initial portion of the text repeatedly while looking for words (the point of that was so as not to "lose" those tokens before sanitizing them). Now I crawl through the words one at a time, so there's no problem with abandoning each one when looking for the next. That turns a quadratic algorithm into a linear one.
This is not my preferred way of writing TeX anymore (for that, you should read this answer: How to write readable commands) but
pgfkeys
is really not the tool for this kind of textual parsing.