[Tex/LaTex] What are category codes

catcodestex-core

Following on from this question, I'd like to ask a more general question:

What are category codes, and what can I achieve by changing them?

Best Answer

When TeX parses input, it assigns each character read a category code. How TeX subsequently interprets the input then depends on both the character and it's category code. There are 16 category codes that can be set by the programmer, plus one special internal one. The 16 standard ones number from 0 upward. Category code 0 is for escape characters, usually \. The rest are then (with typical examples):

Begin group: {
End group: }
Math shift: $
Alignment: &
End-of-line
Parameter for macros: #
Math superscript: ^
Math subscript: _
Ignored entirely
Space
Letters: the alphabet.
'Other' character - everything else: ., 1, :, etc.
Active character - to be interpreted as control sequences: ~
Start-of-comment: %
Invalid-in-input: [DEL]

Now when TeX reads input, each character is associated with a category code to generate tokens. So if the input reads

$ 1^{23}_a $

TeX reads:

A math shift token, and goes into math mode
A space, which is ignored in math mode
An 'other' token 1, which is simply typeset here
A math superscript token, thus meaning that the next item will be superscripted
A begin-group token,
The 'other' tokens 2 and 3, which cannot be typeset until the group finishes
The close-group token }, which allows TeX to typeset the superscript
A math subscript token, so moving the next item to a subscript position
The letter a, which with no special meaning is typeset
A space, again ignored
A math shift token, and goes back into horizontal mode

Category codes often become important when TeX is deciding on what is and is not a control sequence. With only the alphabet as 'letters', something like

\hello@

is the control sequence \hello followed by the 'other' token @. On the other hand, if I make @ a letter

\catcode`\@=11\relax
\hello@

then TeX will look for a macro called \hello@. This is commonly used in TeX code to isolate 'code' macros from 'user' ones. So you find programming macros such as \@for. Without changing the category code, this is effectively 'hidden'. The idea of this is to 'protect the user from themselves': it's hard to break the code if you cannot even get at it!

There are many effects that can be achieved using category codes. An obvious one is the non-breaking space ~ used throughout the TeX world. This works because ~ has category code 13, and is therefore 'active'. When TeX reads ~, it looks for a definition for ~ in the same way it would for a macro. That's a lot more convenient than using a macro for these cases.

We can use different category codes to make 'private' code areas. For example, plain TeX and LaTeX2e us @ as an extra 'letter', whereas LaTeX3 uses : and _. That effectively isolates internal LaTeX3 code from LaTeX2e, when the two are used together (as at present).

Verbatim material is another area where category codes are vital (if complex!). The reason you can't nest verbatim material inside anything else is that once TeX has assigned category codes it is only partially reversible. Anything which is 'ignored' or 'comment' is thrown away: you can't get it back. (With e-TeX, you can reassign category codes, but anything that is already gone stays 'lost.)

(Note for the interested) The 'special' category code is 16, which is used in the \ifcat test, amongst other things. It is assigned to unexpandable control sequences in this situation, so that they do not match anything else other than other unexpandable control sequences.

TeX

TeX is a language (a full programming language, actually) for typesetting documents. It originally output to a format called DVI which could then be converted to PostScript, PDF, etc.; more recent versions can output directly to PDF. You write a document with TeX instructions in it, and the TeX system will convert it into printable material.

TeX is used for a wide variety of documents, particularly in science and academia. Most people use it for things that other people would likely use Word for; however, the quality of its results are more on a par with InDesign or other major document layout packages, far superior what word processors generally yield. Designing specialized or ad-hoc document formats such as brochures, however, is probably easier with InDesign or QuarkXPress (although it is not impossible to do so in TeX/LaTeX).

TeX itself is quite low-level.

LaTeX

LaTeX is a macro package written in and for TeX that provides commands and defaults for writing larger documents at a higher level, taking care of things like sectioning, tables of contents, etc. In my experience, most TeX users do not write low-level TeX directly, but rather use LaTeX. LaTeX is not the only such package, though; ConTeXt is another macro package with a different design philosophy, but it sits at a similar level to LaTeX.

Usage

TeX and LaTeX are very widespread in some portions of academia, such as mathematics and computer science, due to its superb support for mathematical formulas. I have also heard that it is popular in some other disciplines as well, such as linguistics.

[Tex/LaTex] How to print and assign category codes

There's really no category code 16. This code is assigned to control sequences (not \let to a character) only for the purposes of \ifcat.

Similarly, character code 256 is assigned to the same tokens for the purposes of \if.

You can get category codes 9 and 15 with

\the\catcode`\^^@
\the\catcode`\^^?

(these are the only characters that have those category codes in the default setting, bytes 0 and 127).

Notice also that \ifcat doesn't compare a token with a number, but two tokens:

\ifcat ab
\ifcat a1

The first evaluates to true, the second to false.

Quoting from the TeXbook, page 209:

• \ifcat<token1><token2> (test if category codes agree)
TeX will expand macros following \if until two unexpandable tokens are found. If either token is a control sequence, TeX considers it to have character code 256 and category code 16, unless the current equivalent of that control sequence has been \let equal to a non-active character token.

The key word is considers. Look at page 38:

A token is either (a) a single character with an attached category code, or (b) a control sequence.

Section 506 of "TeX, the program" tells the truth about this: if the next token is not an active character or is a control sequence (not let to a character), then the variable *cur_chr* gets the value 256.

Best Answer

Related Solutions

[Tex/LaTex] What are TeX and LaTeX

TeX

LaTeX

Usage

[Tex/LaTex] How to print and assign category codes

Related Question