[Tex/LaTex] Which characters are technically legal in macro names with T1

font-encodingsinput-encodingsmacrostex-core

Disclaimer: I am aware that one should not use special characters in macro names and do not recommend doing that (on the contrary). I ask this question purely out of curiosity.

Until recently I thought that only "ordinary" characters could be used in macro names, i.e. letters (a–z, A–Z) and common symbols like digits (0–9) or punktuation (e.g. -, !). Following a question on this site I discovered that this in not true: Even accented Letters are alowed (and can be directly input when also using inputenc):

\documentclass{article}

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

\makeatletter

\begin{document}

\expandafter\def\csname \c c\v c\'e\endcsname{I am weird}
\csname \c c\v c\'e\endcsname:
\expandafter\string\csname \c c\v c\'e\endcsname

% with inputenc:
\def\äöü{Me too}
\äöü:
% \string\äöü does not work
\expandafter\string\csname äöü\endcsname

\end{document}

However, some characters, like \v o or \ss, give me errors. Input of ß (with inputenc) on the other hand works just fine. Surprisingly, this did not work at all withouth the use of fontenc.

Can you give a precise rule for which characters are admissible in macro names?
Why is there a difference between writing \ss and writing ß?
Why is \v c legal but \v o not?
Why does \def\äöü work but \string\äöü not?
Why does this only work when using \usepackage[T1]{fontenc}?

Best Answer

Can you give a precise rule for which characters are admissible in macro names?

Absolutely all bytes 0 to 255 are admissible in macro names. But how convenient they are to type, and how they correspond to characters in the human-visible sense, can depend, among other things, on the catcodes and on the definitions of active characters, which in turn can depend on the packages currently loaded (the input encoding and the font encoding).

The precise rule is that a macro is either:

A single active character: a token with 13 as the category code, and any number 0–255 as the character code.
A control word: an escape character (\) followed by a sequence of letters (tokens with 11 as the category code, and any number 0–255 as the character code).
A control symbol: an escape character (\) followed by a single non-letter (token with anything other than 11 as the category code, and any number 0–255 as the character code).

Before answering the rest of your questions, some explanation.

Like most software systems, TeX (specifically, non-Unicode TeX, i.e. Knuth TeX or pdfTeX, as opposed to XeTeX or LuaTeX) understands only bytes (0 to 255); it doesn't understand “characters” as such. (And like most pre-Unicode systems, its terminology uses “bytes” and “characters” sometimes misleadingly.) To give the illusion of “understanding” bytes as characters, there are two “translations” that happen:

Font encoding: this says where the shapes (glyphs) for certain (what we think of as) characters are “supposed” to be in a font: e.g. under the default (OT1) encoding (and also under the T1 encoding), position 65 (octal '101, hexadecimal "41) is supposed to contain something that looks like an “A”. And position 231 (hexadecimal "E7) is supposed to contain a glyph for the “ç” in the T1 encoding, and not supposed to contain anything in the default (OT1) encoding. Correspondingly, the fontenc package redefines the meanings of \c etc as appropriate.
Input encoding: With \usepackage[utf8]{inputenc}, this sets up certain characters (bytes) as active, so that UTF-8 sequences of bytes can be interpreted as the corresponding Unicode character.

Also: TeX has a way of directly inputting a specific byte in the input file, by ^^ followed by two hex digits (0123456789abcdef), e.g. anywhere you can type 'A' (in text, in a macro name, whatever), you can also type ^^41, etc. Let's use that for clarity.

With that understanding, the two examples in the question are:

\csname \c c\v c\'e\endcsname — here, with \usepackage[T1]{fontenc}, the definitions of \c, \v and \' are such that
- \c c expands to a token with category code 11 and character code 231 (hex e7),
- \v c expands to a token with category code 11 and character code 163 (hex a3),
- \' e expands to a token with category code 11 and character code 233 (hex e9).
So the following are equivalent:
```
\expandafter\def\csname \c c\v c\'e\endcsname{I am weird}
```
and
```
{\catcode"E7=11 \catcode"A3=11 \catcode"E9=11
 \expandafter\def\csname ^^e7^^a3^^e9\endcsname{I am weird}}
```
and simply
```
{\catcode"E7=11 \catcode"A3=11 \catcode"E9=11
\def\^^e7^^a3^^e9{I am weird}}
```
This is a macro of the “control word” type: a backslash followed by a sequence of three letters.
Here, äöü in the input file is (assuming you've saved the file in the UTF-8 encoding) the sequence of bytes C3 A4 C3 B6 C3 BC. Further, \usepackage[utf8]{inputenc} changes the catcodes of all these bytes to active. So the following two are equivalent:
```
% Assuming UTF-8 inputenc
\def\äöü{Me too}
```
and
```
{\catcode"C3=13 \catcode"A4=13 \catcode"B6=13 \catcode"BC=13 % Same as those set by \usepackage[utf8]{inputenc}
\def\^^c3^^a4^^c3^^b6^^c3^^bc{Me too}}
```
This is a macro of the “control symbol” type: what it has actually defined is \^^c3 (a single nonletter), with the requirement that when used it's supposed to be followed by the tokens ^^a4^^c3^^b6^^c3^^bc all of catcode 13. (Else you'll get something like Use of \^^c3 does not match its definition.)

Now to answer the rest of your questions:

Why is \v c legal but \v o not?

\v c expands to the token with category code 11 (letter) and character code 163 (hex "A3). This you can see is the character č in T1.
\v o does not expand to a single character token (there is a č but no ǒ in the T1 encoding), but to instructions to add an appropriate accent to the o character. Inside \csname ... \endcsname, everything should expand to just character tokens.

Why is there a difference between writing \ss and writing ß?

There's not much of a difference really; just that you (I guess) tried the former inside \csname … \endcsname, and the latter directly after \def.

Unlike the earlier case where (for example) \c c expands to a single token with category code 11 and character code 231, \ss expands to \char"FF — that is, the TeX primitive command \char, followed by (if \char is being processed) the number "FF. (This is different from the token ^^ff, though why fontenc doesn't define \ss to expand to a single character token I don't know.) This too is not allowed inside \csname … \endcsname.

ß too expands to something similar (you can't use it inside \csname … \endcsname either), but if you're using it after \def directly, then without expansion it's a sequence of two active characters ^^c3^^9f, and \def doesn't expand the tokens.

Why does \def\äöü work but \string\äöü not?

See above for why \def\äöü works: it's \def\^^c3^^a4^^c3^^b6^^c3^^bc.

And \string\äöü is \string\^^c3^^a4^^c3^^b6^^c3^^bc which is \string\^^c3 (which works: try it) followed by ^^a4^^c3^^b6^^c3^^bc (and the first byte there, the second byte of the UTF-8 representation of ä, has been defined as an active character that throws an error, because it should never appear on its own in valid UTF-8).

Why does this only work when using \usepackage[T1]{fontenc}?

The definition of the control symbol, as in \def\äöü{Me too}, will work with or without \usepackage[T1]{fontenc}, so will its usage. But if you want to use these “special” characters inside \csname ... \endcsname, then you need their definitions to be things that expand to just character tokens (which \usepackage[T1]{fontenc} does, because it can: those characters exist in the font), rather than expand to instructions for placing accents above/below other characters (which is what happens without \usepackage[T1]{fontenc}, as there's no alternative).

Best Answer

Related Solutions

[Tex/LaTex] Test if token is a control sequence

[Tex/LaTex] Are \end…. macro names reserved in LaTeX2e

Related Question