[Tex/LaTex] Which commands consisting of a non-letter eat spaces afterwards

spacingtex-core

Command names consist of a sequence of letters (until and excluding the first non-letter) or of a single non-letter.

Command names ending in a letter (including of course those consisting of a single letter, such as \S and \P) eat all spaces after them. (Compile for example \P␣␣,H.)

What about command names consisting of a single non-letter: which of these eat subsequent spaces? H\$H and H\$␣H produce different output (and that's true for e.g. % and & too). While the space-producing command \, does not eat spaces, the space-producing command \␣ does seem to eat all subsequent spaces. (H\␣H and H\␣␣H and H\␣␣␣H all produce the same output. H \␣H is different; see next paragraph.)

Knowledge of TeX's behavior will explain why for example A\,B, A␣\,B/A\,␣B, A␣\,␣B produce different results (in text mode); if a user is not aware of what happens and naively (but understandably) assumes that such spacing commands eat all spaces around them, he or she might run into surprises. (Actually only few commands seem to eat preceding spaces, though such behavior is possible: have your macro start with \unskip.)

Guide to the answers:

most concise summary (≈ "only \␣"): Joseph Wright's answer [if it weren't for Heiko's answer, I would have accepted this one]
all the details (with an interesting detail about empty (!) command names): Heiko Oberdiek's answer
apparent exceptions (the 7 standard one-letter, accent-producing commands and \\): Mico's answer

Best Answer

From "The TeXbook":

$\ddanger$ If TeX sees an escape character (category 0) in any state, it scans the entire control sequence name as follows. (a) If there are no more characters in the line, the name is empty (like \csname\endcsname). Otherwise (b) if the next character is not of category 11 (letter), the name consists of that single symbol. Otherwise (c) the name consists of all letters beginning with the current one and ending just before the first nonletter, or at the end of the line. This name becomes a control sequence token. TeX goes into state S in case (c), or in case (b) with respect to a character of category 10 (space) [read: "in case (b) if the single symbol is of category 10 (space)"]; otherwise TeX goes into state M.

State S is beginning of line, there spaces are ignored; state M is middle of line.

If the name consists of letters entirely, the length does not matter, one or more letters. Then TeX ignores spaces as in the begin of a line. The same happens in case of the command \␣. The command itself sets a space, but following spaces are ignored.

Backslash at line end:

If TeX reads a line, it removes the end of line characters (carriage return and/or linefeed) and all space characters from the right end (i.e., any such characters occurring immediately before the end of line character). Then it inserts the character, configured by \endlinechar, unless it is suppressed (e.g. it has a negative value). From "The TeXbook":

TeX deletes any ⟨space⟩ characters (number 32) that occur at the right end of an input line. Then it inserts a ⟨return⟩ character (number 13) at the right end of the line, except that it places nothing additional at the end of a line that you inserted with I during error recovery. Note that ⟨return⟩ is considered to be an actual character that is part of the line; you can obtain special effects by changing this catcode.

...

The special character inserted at the end of each line needn't be ⟨return⟩; TeX actually inserts the current value of an integer parameter called \endlinechar, which normally equals 13 but it can be changed like any other parameter. If the value of \endlinechar is negative or greater than 255, no character is appended, and the effect is as if every line ends with % (i.e., with a comment character).

Note: LuaTeX restricts the values of \endlinechar. The upper limit is 127. Larger values cause the error ! Invalid \endlinechar.

In LaTeX the end of line character is ^^M (character code 13, 0x0D) and has category 5 (end of line). If TeX is in state M, this end-of-line character is converted to a space [this is the important part!], thus the backslash at the end of line usually becomes \␣.

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{lmodern}

\begin{document}

\expandafter\def\csname\endcsname{<empty>}
\def\ {<space>}
\def\@{<at>}

[\
]

\begingroup
  \endlinechar=-1
  [\
  ] 
\endgroup

\begingroup
  \endlinechar=`@ %
  [\
  ]%
\endgroup %

\end{document}

Related Solutions

[Tex/LaTex] When to use \edef, \noexpand, and \expandafter

Expansion is a complicated area of TeX programming. I'll try to explain the key primitives involved first, then try to come up with some examples.

The \expandafter primitive expands the token after the next one. So

\expandafter\def\csname an-awkward-name\endcsname

will expand \csname before \def. So after one expansion the above turns into

\def\an-awkward-name

which will then do its thing. Life becomes more complex when you want to step further ahead, and it soon becomes very hard to track what is going on.

The \edef> primitive does a full expansion of what is given as its argument (in contrast to \def, which simply stores the input). So

\def\examplea{more stuff}
\edef\exampleb{Some stuff \csname examplea\endcsname}

will expand the \csname name\endcsname to \examplea, then expand that to leave a final definition of \exampleb as 'Some stuff more stuff'.

Now, \noexpand comes in by preventing \edef from doing an expansion of the next token. So if I modify my above example to read

\def\examplea{more stuff}
\edef\exampleb{Some stuff \expandafter\noexpand\csname examplea\endcsname}

then what will happen is that the \edef will execute the \expandafter, which will turn the above effectively into

\def\examplea{more stuff}
\edef\exampleb{Some stuff \noexpand\examplea}

Now the \noexpand will operate (disappearing in the process), leaving the definition of \exampleb as 'Some stuff \examplea'.

We can use this ability to cut down on \expandafter use, but there are a couple of other things to know. First, e-TeX includes an additional primitive \unexpanded, which will prevent expansion of multiple tokens. Secondly, there are various special cases where you don't need quite so many \expandafter statements. A classic example is from within \csname, as this will do expansion anyway. So you'll see things like

\csname name\expandafter\endcsname\token

which will expand \token before \name.

Back to your example. In the first one, there isn't much to do: as the entire point is to have a dynamic name (#1), doing an \edef at point-of-definition doesn't really make sense. The closest one can get is something like

\edef\cohtheory{%
  \noexpand\newcommand\expandafter\noexpand\csname foofunc\endcsname[1][*]{%
  \noexpand\MakeUppercase{foo}^{##1}}%
}

What will happen here is that \newcommand and \MakeUppercase will be protected from expansion, and the \csname will only expand once. (Tokens which don't have an expansion don't need protection, which is why things like '[1]' are simply included as is.) Of course, this is something of a 'toy' as all it does is create a fixed \foofunc.

For your second example, you could instead to

\begingroup
  \edef\temp{%
    \endgroup
    \noexpand\command
    {\unexpanded\expandafter{\argone}}%
    {\unexpanded\expandafter{\argtwo}}%
  }
\temp

I'm using a couple of extra ideas here. First, the group is used so that \temp is not altered anywhere other than where I'm using it. The \endgroup primitive will do nothing inside the \edef, and so will still be there to close the group when \temp is used. Secondly, \unexpanded works like a toks, and so will respect the \expandafter after it but before the {. This cuts down on an unnecessary \expandafter.

There are more wrinkles to this, and often there are several equally-efficient and clear methods. You are best off posting specific examples, and seeking advice on how they might be achieved.

[Tex/LaTex] Can one define an expandable command that removes control sequences from its argument

I am pleased to be able to teach Martin Scharrer something he didn't know :)

A fully expandable sanitizer

The following is an implementation of a \Sanitize command that:

Completely removes all control sequences and balanced braces in its argument.
Does not choke on nested braces.
Keeps spaces where they were requested either by " " or "\ " (for after macros).
Is fully expandable (i.e. can be put in \edef or \csname).

Edit: This is a revised version. My initial code had a few minor bugs that were a major pain to fix, and this is substantially rewritten. I think it's clearer, too.

How it works

There are three states: sanitizing spaces, sanitizing groups, and sanitizing tokens. We scan for "words" one at a time, then within each "word" look for groups that might be hiding spaces (TeX's macro scanner will only absorb delimited arguments with balanced braces). Finally, once we are satisfied that we are looking at genuinely contiguous tokens, we scan one at a time and throw out the ones that are control sequences, leaving only explicitly specified spaces (" " or "\ ").

From the inside out, the operation looks like this:

\SanitizeTokens is a big nested conditional that tests its argument against the various special cases. During the sweep for spaces, all space characters were converted to \SanitizedSpace tokens, and they are now converted to \RealSpaces. Both \SanitizedSpace and \SanitizeStop are macros that expand to themselves, and since they are private, this means that testing against them via \ifx is a reliable way to detect the exact control sequences (in the first version, these were \countdef tokens, which have the same property but are not quite as private).
\SanitizeGroups uses the tricky \def\SanitizeGroups#1#{ construction discussed in this question: Macros with # as the last parameter. It is the most legitimate such use I can imagine: its point is to detect groups, which you can't do using plain macro expansion in any other way. It guarantees that #1 has no groups in it, and since this comes after space elimination, it also has no spaces in it, so we can run \SanitizeTokens straight away. We then "enter" the group and go back to eliminating spaces.
\SanitizeSpaces uses pattern matching to grab the first chunk of text up until a space, excluding of course those spaces that are in groups. There is a technical trick here: every use of this macro has {} right after it, before the text. The point of that is so that the argument scanner doesn't remove braces around a group constituting an entire "word" between spaces. If that happens, then we will erroneously treat it as though it's been cleared of spaces when, in fact, it has not. (Any unsanitized spaces would be eaten by \SanitizeTokens because argument scanning ignores spaces.)
There are of course some cute utility macros. My favorite is \IfNoGapToStop, which is called like this: \IfNoGapToStop.X. \SanitizeStop, with X being the quantity potentially containing a gap. If it has none, then the first gap is the visible space after the period; if it has a gap, then the two periods are in different components, and both arguments of \IfNoGapToStop are nonempty.

Aside from the structural changes from the previous version, this one correctly preserves spaces at the boundaries of groups. (That version didn't explicitly scan for groups, but eliminated them as a side effect of absorbing tokens. That works, but it also makes it impossible to be sure when you are looking at a group, which may have spaces, rather than a single token.)

Oh, and of course: the algorithm is no longer stupid. The last version rescanned the entire initial portion of the text repeatedly while looking for words (the point of that was so as not to "lose" those tokens before sanitizing them). Now I crawl through the words one at a time, so there's no problem with abandoning each one when looking for the next. That turns a quadratic algorithm into a linear one.

This is not my preferred way of writing TeX anymore (for that, you should read this answer: How to write readable commands) but pgfkeys is really not the tool for this kind of textual parsing.

\documentclass{article}

\makeatletter
\newcommand\Sanitize[1]{%
 \SanitizeSpaces{}#1 \SanitizeStop
}

% This loops through and replaces all spaces (outside brace groups) with \SanitizedSpace's.
% Then it goes for the control sequences.
% All calls to this should put a {} right before the content, to inhibit the gobbling of braces
% if there is a group right at the beginning.
\def\SanitizeSpaces#1 #2\SanitizeStop{%
 \IfEmpty{#2}% Last word
  {\IfEmpty{#1}% No content at all
   {}% Nothing to do
   {\SanitizeGroups#1{\SanitizeStop}}%
  }%
  % No need for a trailing space anymore: there's already one from the initial call
  {\SanitizeGroups#1\SanitizedSpace{\SanitizeStop}\SanitizeSpaces{}#2\SanitizeStop}%
}

% Sanitize tokens up to the next group, then go back to doing spaces.
\def\SanitizeGroups#1#{%
 \SanitizeTokens#1\SanitizeStop
 \EnterGroup
}

% Sanitize the next group from the top.
\newcommand\EnterGroup[1]{%
 \ifx\SanitizeStop#1%
  \expandafter\@gobble
 \else
  \expandafter\@firstofone
 \fi
 {\SanitizeSpaces{}#1 \SanitizeStop\SanitizeGroups}%
}

\newcommand\SanitizeTokens[1]{%
 \ifx\SanitizeStop#1%
 \else
  \ifx\SanitizedSpace#1%
   \RealSpace
  \else
   \ifx\ #1%
    \RealSpace
   \else
    \if\relax\noexpand#1%
    \else
     #1%
    \fi
   \fi
  \fi
  \expandafter\SanitizeTokens
 \fi
}

% We use TeX's proclivity to eat braces even for delimited arguments to eat the braces if #1 
% happens to be just {}, which we put in.
% Even if we didn't put it in, {} is going to get thrown out when \SanitizeSpaces gets to it.
\newcommand\IfEmpty[1]{%
 \IfOneTokenToStop.#1\SanitizeStop
  {% #1 has at most space tokens
   % and thus is nonempty if and only if there is a gap:
   \IfNoGapToStop.#1. \SanitizeStop
  }
  {% #1 has non-space tokens
   \@secondoftwo
  }%
}

% Checks for a gap in #1, meaning #2 is nonempty
% This should only be used with \IfEmpty
\def\IfNoGapToStop#1 #2\SanitizeStop{%
 % It's enough to check for one token, since #2 is never just spaces
 \IfOneTokenToStop.#2\SanitizeStop
}

\def\IfOneTokenToStop#1#2{% From \IfEmpty, #1 is always a .
 \ifx\SanitizeStop#2%
  % If #2 is multi-token, the rest of it will fall in the one-token case and be passed over.
  % If not, well, that's what we asked for.
  \expandafter\@firstoftwo
 \else
  \expandafter\GobbleToStopAndSecond
 \fi
}

\def\GobbleToStopAndSecond#1\SanitizeStop{%
 \@secondoftwo
}
\makeatother

\def\SanitizeStop{\SanitizeStop}
\def\SanitizedSpace{\SanitizedSpace}
\def\RealSpace{ }

\begin{document}
\setlength\parindent{0pt}\tt

% Torture test
\edef\a{%
 \Sanitize{ Word1 \macro{Word2 Word3}{\macro\ Word4}{ Word5} {Word6 }{}Word7{ }{{Word8}} }
}\meaning\a

\a
\medskip

% Examples
\edef\a{%
 \Sanitize{\emph{This} sentence has \TeX\ macros and {grouping}. }
}\meaning\a

\a
\medskip

\edef\a{%
 \Sanitize{{A}{ gratuitously {nested} sentence {}{{with many} layers}}.}
}\meaning\a

\a
\medskip

\end{document}