See this paper (I think I learned of it from this answer):
Abstract:
While the quality of the results of TeX's mathematical formula layout algorithm is convincing, its original description is hard to understand since it is presented as an imperative program with complex control flow and destructive manipulations of the data structures representing formulae. In this paper, we present a re-implementation of TeX's formula layout algorithm in the functional language SML, thereby providing a more readable description of the algorithm, extracted from the monolithical TeX system.
Review (not sure where this is from or who wrote this, but I think it may be a feature of that journal?):
The TeX typesetting system basically consists of two parts: a macro language and certain basic typesetting capabilities, such as placing boxes next to each other, either vertically or horizontally. However, the real documentation for the typesetting capabilities is the TeXbook and the TeX source code.
This paper is truly welcome. The basic typesetting functions used by TeX should become more clear and should be reachable from a decent declarative language, thereby allowing documents to be typeset or generated in a declarative manner.
As ideas are better and better understood, they are expressed in a more and more declarative manner. The work done in this paper shows how this can be done for a very useful algorithm written in a very unreadable form. This sort of work should be done for other algorithms that are commonly used.
Work of the kind done in this paper allows one to perceive the need for more general means for laying out two-dimensional diagrams, with aesthetic or other constraints defining the actual placement of atomic components to form more complex diagrams.
In the paper, after explaining why they find the TeX program source code hard to read, they extract and reimplement the relevant part:
The main task of the formula layout algorithm consists in translating formula terms (a kind of abstract syntax for formulae) into box terms which describe the sizes and relative positions of the formula constituents in the final layout.
And this is what is implemented. After reading the paper, you can find their full code (2312 lines across 58 .sml
files) at the location described in their paper, and also mirrored elsewhere. Note that (as the paper says) it does not implement everything in TeX: even square roots are not implemented.
Otherwise you can look at more recent (and still maintained) programs that implement just the math part of TeX, like MathJax (source), KaTeX (source), iosMath, or its port AndroidMath.
Best Answer
Parsing TeX is Turing complete
TeX can only be parsed by a complete Turing machine (modulo the finite space available), which precludes it from having a BNF. This comes from a combination of two features: first, TeX is Turing complete (if you need proof, this Turing machine simulator should suffice); and second, TeX can redefine macros (and their parsing rules) at runtime. Since TeX can require that macros be followed by specific characters, redefining a macro can mean redefining the syntax of TeX. Combining these facts means that we can write TeX code like the following, where
\f
is defined to be the algorithmic representation in TeX of some arbitrary (computable) function f : ℤ → ℤ:Will this TeX parse? That depends on the value of f(0) /
\f{0}
! And since TeX is Turing-complete, computing this value may require a full Turing machine, and therefore so will determining whether or not\Required!
is valid TeX syntax.It gets worse: category codes
Perhaps, though, you think that you can still perform a rough check for code that looks like a backslash, followed by letters, followed by braced stuff. This will already get inadequate quickly, but that's not all. If delimited macros weren't bad enough, things get even worse when you realize that the reason that the characters
\
,{
,}
,$
, and so on have the behavior they do is simply because TeX specifies that they have particular "category codes". The ability to redefine these means that writing TeX likemeans that all subsequent (La)TeX code must be written like so:
to get output like so:
Constructive Proof that TeX is (at least) Context-Sensitive
Additionally, inspired by TH.'s answer, here's a specific constructive proof that TeX is not context-free. And note that it doesn't have to rely on catcode trickery—just TeX's ability to define arbitrarily-delimited macros. (Yes, this is implied by the first section, but it's still nice to see a specific case in action.)
This code is set up to recognize what Wikipedia calls "the canonical non-context free language" of n copies of
\a
followed by n copies of\b
followed by n copies of\c
, where n is at least one. The first line just allocates a counter; we'll use this to keep track of how many copies of the control sequences we see, but it's just so we can display some text at the end, not for parsing. To attempt recognition of some text, call\RecognizeABC{...}
. This simply sets\nesting
to zero (for, again, output purposes), and calls\OneStep
.\OneStep
is set up to take three arguments delimited by\a
,\b
, and\c
. The first argument must come between\a
and\b
, the second between\b
and\c
, and the third between\c
and\relax
. If there are the same number of each of\a
,\b
, and\c
, then this will eventually be\OneStep\a\b\c\relax
and all the arguments will be empty. With\OneStep
, we just record the fact that there's one more of each\a
,\b
, and\c
(which is, again, just for output), and then we see if we're done.\CheckFinished
looks to see what the next token in the input stream is. If it's\relax
, then the arguments were empty, and we're done; redefine\OneStep
not to do anything. Now, call\OneStep
—whether it's the parsing macro or nothing at all—on the remaining input stream, recursing. Once we're done, we finally finish\RecognizeABC
, where we typeset how many copies of each control sequence we saw.Phew! If you didn't follow all that, here's what a trace looks like, if we only look at the important macros:
Thus, the following TeX
Yields the following output:
But TeX such as
fails.
Edit 1: In addition to minor bugfixes/cleanup and breaking the answer into sections with headings, I expanded the rationale for TeX being unparsable by anything short of a Turing machine to hopefully make it slightly more clear. This involved removing the previous
\weird
example, replacing it with something that's more of a proof. If you'd rather have a concrete example (but non-proof) with a command which redefines itself, consider the following TeX, which was there before:To start,
\weird
must be called as\weird[stuff][other stuff]
. But if its first argument is\relax
, then it redefines itself to require being called as\weird!
; if its second argument is nothing, it redefines itself to require being called as\weird(stuff)
; and it always typesets text. This can't be captured by a BNF grammar (I don't think). And of course, in the general case, we need a Turing machine.