[Tex/LaTex] What are category codes

catcodestex-core

Following on from this question, I'd like to ask a more general question:

What are category codes, and what can I achieve by changing them?

Best Answer

When TeX parses input, it assigns each character read a category code. How TeX subsequently interprets the input then depends on both the character and it's category code. There are 16 category codes that can be set by the programmer, plus one special internal one. The 16 standard ones number from 0 upward. Category code 0 is for escape characters, usually \. The rest are then (with typical examples):

  1. Begin group: {
  2. End group: }
  3. Math shift: $
  4. Alignment: &
  5. End-of-line
  6. Parameter for macros: #
  7. Math superscript: ^
  8. Math subscript: _
  9. Ignored entirely
  10. Space
  11. Letters: the alphabet.
  12. 'Other' character - everything else: ., 1, :, etc.
  13. Active character - to be interpreted as control sequences: ~
  14. Start-of-comment: %
  15. Invalid-in-input: [DEL]

Now when TeX reads input, each character is associated with a category code to generate tokens. So if the input reads

$ 1^{23}_a $

TeX reads:

  • A math shift token, and goes into math mode
  • A space, which is ignored in math mode
  • An 'other' token 1, which is simply typeset here
  • A math superscript token, thus meaning that the next item will be superscripted
  • A begin-group token,
  • The 'other' tokens 2 and 3, which cannot be typeset until the group finishes
  • The close-group token }, which allows TeX to typeset the superscript
  • A math subscript token, so moving the next item to a subscript position
  • The letter a, which with no special meaning is typeset
  • A space, again ignored
  • A math shift token, and goes back into horizontal mode

Category codes often become important when TeX is deciding on what is and is not a control sequence. With only the alphabet as 'letters', something like

\hello@

is the control sequence \hello followed by the 'other' token @. On the other hand, if I make @ a letter

\catcode`\@=11\relax
\hello@

then TeX will look for a macro called \hello@. This is commonly used in TeX code to isolate 'code' macros from 'user' ones. So you find programming macros such as \@for. Without changing the category code, this is effectively 'hidden'. The idea of this is to 'protect the user from themselves': it's hard to break the code if you cannot even get at it!

There are many effects that can be achieved using category codes. An obvious one is the non-breaking space ~ used throughout the TeX world. This works because ~ has category code 13, and is therefore 'active'. When TeX reads ~, it looks for a definition for ~ in the same way it would for a macro. That's a lot more convenient than using a macro for these cases.

We can use different category codes to make 'private' code areas. For example, plain TeX and LaTeX2e us @ as an extra 'letter', whereas LaTeX3 uses : and _. That effectively isolates internal LaTeX3 code from LaTeX2e, when the two are used together (as at present).

Verbatim material is another area where category codes are vital (if complex!). The reason you can't nest verbatim material inside anything else is that once TeX has assigned category codes it is only partially reversible. Anything which is 'ignored' or 'comment' is thrown away: you can't get it back. (With e-TeX, you can reassign category codes, but anything that is already gone stays 'lost.)


(Note for the interested) The 'special' category code is 16, which is used in the \ifcat test, amongst other things. It is assigned to unexpandable control sequences in this situation, so that they do not match anything else other than other unexpandable control sequences.

Related Question