[Tex/LaTex] BNF grammar of the TeX language

grammarparsingtex-core

I'm looking for a BNF grammar of the TeX language, does it exist?

EDIT

For those of us who are not computer scientists, a BNF grammar is one kind of formal description of a CFG: Backus Naur Form.

For those who don't know their Chomsky, a CFG is a context-free grammar, which means (very roughly) that substitutions can be made independently of their use context.

Best Answer

Parsing TeX is Turing complete

TeX can only be parsed by a complete Turing machine (modulo the finite space available), which precludes it from having a BNF. This comes from a combination of two features: first, TeX is Turing complete (if you need proof, this Turing machine simulator should suffice); and second, TeX can redefine macros (and their parsing rules) at runtime. Since TeX can require that macros be followed by specific characters, redefining a macro can mean redefining the syntax of TeX. Combining these facts means that we can write TeX code like the following, where \f is defined to be the algorithmic representation in TeX of some arbitrary (computable) function f : ℤ → ℤ:

\def\TuringCompleteness#1#2{%
  \edef\Output{#1{#2}}%
  \ifnum\Output=0
    \def\Required!{Bang!}%
  \else
    \def\Required?{Huh?}%
  \fi}
\TuringCompleteness{\f}{0}
\Required!

Will this TeX parse? That depends on the value of f(0) / \f{0}! And since TeX is Turing-complete, computing this value may require a full Turing machine, and therefore so will determining whether or not \Required! is valid TeX syntax.


It gets worse: category codes

Perhaps, though, you think that you can still perform a rough check for code that looks like a backslash, followed by letters, followed by braced stuff. This will already get inadequate quickly, but that's not all. If delimited macros weren't bad enough, things get even worse when you realize that the reason that the characters \, {, }, $, and so on have the behavior they do is simply because TeX specifies that they have particular "category codes". The ability to redefine these means that writing TeX like

\catcode`\~=0
\catcode`\]=1
\catcode`\[=2
\catcode`\}=12
\catcode`\{=12
\catcode`\\=12

means that all subsequent (La)TeX code must be written like so:

~textbf]This is bold-faced text.[  This text contains a literal backslash
(~texttt]\[) and literal curly braces (~texttt]{}[).

to get output like so:

This is bold-faced text. This text contains a literal backslash (\) and literal curly braces ({}).


Constructive Proof that TeX is (at least) Context-Sensitive

Additionally, inspired by TH.'s answer, here's a specific constructive proof that TeX is not context-free. And note that it doesn't have to rely on catcode trickery—just TeX's ability to define arbitrarily-delimited macros. (Yes, this is implied by the first section, but it's still nice to see a specific case in action.)

\newcount\nesting

\def\RecognizeABC#1{\begingroup
  \nesting=0
  \OneStep#1\relax
  \ifnum\nesting=1
    There was \the\nesting\ copy of each control sequence.\par
  \else
    There were \the\nesting\ copies of each control sequence.\par
  \fi
\endgroup}

\def\CheckFinished#1{%
  \ifx#1\relax
    \def\OneStep#1{\relax}%
  \fi
  \OneStep#1}

\def\OneStep\a#1\b#2\c#3\relax{%
  \advance\nesting by 1
  \CheckFinished#1#2#3\relax}

This code is set up to recognize what Wikipedia calls "the canonical non-context free language" of n copies of \a followed by n copies of \b followed by n copies of \c, where n is at least one. The first line just allocates a counter; we'll use this to keep track of how many copies of the control sequences we see, but it's just so we can display some text at the end, not for parsing. To attempt recognition of some text, call \RecognizeABC{...}. This simply sets \nesting to zero (for, again, output purposes), and calls \OneStep. \OneStep is set up to take three arguments delimited by \a, \b, and \c. The first argument must come between \a and \b, the second between \b and \c, and the third between \c and \relax. If there are the same number of each of \a, \b, and \c, then this will eventually be \OneStep\a\b\c\relax and all the arguments will be empty. With \OneStep, we just record the fact that there's one more of each \a, \b, and \c (which is, again, just for output), and then we see if we're done. \CheckFinished looks to see what the next token in the input stream is. If it's \relax, then the arguments were empty, and we're done; redefine \OneStep not to do anything. Now, call \OneStep—whether it's the parsing macro or nothing at all—on the remaining input stream, recursing. Once we're done, we finally finish \RecognizeABC, where we typeset how many copies of each control sequence we saw.

Phew! If you didn't follow all that, here's what a trace looks like, if we only look at the important macros:

[START] \RecognizeABC{\a\a\a\b\b\b\c\c\c}
    --> \OneStep\a\a\a\b\b\b\c\c\c\relax
                  -#1-  -#2-  -#3-
    --> \CheckFinished\a\a\b\b\c\c\relax
    --> \OneStep\a\a\b\b\c\c\relax
                  #1  #2  #3
    --> \CheckFinished\a\b\c\relax
    --> \OneStep\a\b\c\relax
          (All the arguments are empty.)
    --> \CheckFinished\relax
    --> ``There were 3 copies of each control sequence.''

Thus, the following TeX

\RecognizeABC{\a\b\c}
\RecognizeABC{\a\a\b\b\c\c}
\RecognizeABC{\a\a\a\b\b\b\c\c\c}

Yields the following output:

There was 1 copy of each control sequence.
There were 2 copies of each control sequence.
There were 3 copies of each control sequence.

But TeX such as

\RecognizeABC{}
\RecognizeABC{\a}
\RecognizeABC{\a\a\b\b\c\c\c}
\RecognizeABC{\a\b\c\garbage}

fails.


Edit 1: In addition to minor bugfixes/cleanup and breaking the answer into sections with headings, I expanded the rationale for TeX being unparsable by anything short of a Turing machine to hopefully make it slightly more clear. This involved removing the previous \weird example, replacing it with something that's more of a proof. If you'd rather have a concrete example (but non-proof) with a command which redefines itself, consider the following TeX, which was there before:

\def\weird[#1][#2]{%
  \ifx#1\relax
    \def\weird!{I was previously called with a relaxing first argument.}%
  \else\ifx#2\relax
    \def\weird(##1){I can only be called with one argument; I was given (##1).}%
  \fi\fi
  I was called with [#1] and [#2].}

To start, \weird must be called as \weird[stuff][other stuff]. But if its first argument is \relax, then it redefines itself to require being called as \weird!; if its second argument is nothing, it redefines itself to require being called as \weird(stuff); and it always typesets text. This can't be captured by a BNF grammar (I don't think). And of course, in the general case, we need a Turing machine.

Related Question