Parsing TeX is Turing complete
TeX can only be parsed by a complete Turing machine (modulo the finite space available), which precludes it from having a BNF. This comes from a combination of two features: first, TeX is Turing complete (if you need proof, this Turing machine simulator should suffice); and second, TeX can redefine macros (and their parsing rules) at runtime. Since TeX can require that macros be followed by specific characters, redefining a macro can mean redefining the syntax of TeX. Combining these facts means that we can write TeX code like the following, where \f
is defined to be the algorithmic representation in TeX of some arbitrary (computable) function f : ℤ → ℤ:
\def\TuringCompleteness#1#2{%
\edef\Output{#1{#2}}%
\ifnum\Output=0
\def\Required!{Bang!}%
\else
\def\Required?{Huh?}%
\fi}
\TuringCompleteness{\f}{0}
\Required!
Will this TeX parse? That depends on the value of f(0) / \f{0}
! And since TeX is Turing-complete, computing this value may require a full Turing machine, and therefore so will determining whether or not \Required!
is valid TeX syntax.
It gets worse: category codes
Perhaps, though, you think that you can still perform a rough check for code that looks like a backslash, followed by letters, followed by braced stuff. This will already get inadequate quickly, but that's not all. If delimited macros weren't bad enough, things get even worse when you realize that the reason that the characters \
, {
, }
, $
, and so on have the behavior they do is simply because TeX specifies that they have particular "category codes". The ability to redefine these means that writing TeX like
\catcode`\~=0
\catcode`\]=1
\catcode`\[=2
\catcode`\}=12
\catcode`\{=12
\catcode`\\=12
means that all subsequent (La)TeX code must be written like so:
~textbf]This is bold-faced text.[ This text contains a literal backslash
(~texttt]\[) and literal curly braces (~texttt]{}[).
to get output like so:
This is bold-faced text. This text contains a literal backslash (\
) and literal curly braces ({}
).
Constructive Proof that TeX is (at least) Context-Sensitive
Additionally, inspired by TH.'s answer, here's a specific constructive proof that TeX is not context-free. And note that it doesn't have to rely on catcode trickery—just TeX's ability to define arbitrarily-delimited macros. (Yes, this is implied by the first section, but it's still nice to see a specific case in action.)
\newcount\nesting
\def\RecognizeABC#1{\begingroup
\nesting=0
\OneStep#1\relax
\ifnum\nesting=1
There was \the\nesting\ copy of each control sequence.\par
\else
There were \the\nesting\ copies of each control sequence.\par
\fi
\endgroup}
\def\CheckFinished#1{%
\ifx#1\relax
\def\OneStep#1{\relax}%
\fi
\OneStep#1}
\def\OneStep\a#1\b#2\c#3\relax{%
\advance\nesting by 1
\CheckFinished#1#2#3\relax}
This code is set up to recognize what Wikipedia calls "the canonical non-context free language" of n copies of \a
followed by n copies of \b
followed by n copies of \c
, where n is at least one. The first line just allocates a counter; we'll use this to keep track of how many copies of the control sequences we see, but it's just so we can display some text at the end, not for parsing. To attempt recognition of some text, call \RecognizeABC{...}
. This simply sets \nesting
to zero (for, again, output purposes), and calls \OneStep
. \OneStep
is set up to take three arguments delimited by \a
, \b
, and \c
. The first argument must come between \a
and \b
, the second between \b
and \c
, and the third between \c
and \relax
. If there are the same number of each of \a
, \b
, and \c
, then this will eventually be \OneStep\a\b\c\relax
and all the arguments will be empty. With \OneStep
, we just record the fact that there's one more of each \a
, \b
, and \c
(which is, again, just for output), and then we see if we're done. \CheckFinished
looks to see what the next token in the input stream is. If it's \relax
, then the arguments were empty, and we're done; redefine \OneStep
not to do anything. Now, call \OneStep
—whether it's the parsing macro or nothing at all—on the remaining input stream, recursing. Once we're done, we finally finish \RecognizeABC
, where we typeset how many copies of each control sequence we saw.
Phew! If you didn't follow all that, here's what a trace looks like, if we only look at the important macros:
[START] \RecognizeABC{\a\a\a\b\b\b\c\c\c}
--> \OneStep\a\a\a\b\b\b\c\c\c\relax
-#1- -#2- -#3-
--> \CheckFinished\a\a\b\b\c\c\relax
--> \OneStep\a\a\b\b\c\c\relax
#1 #2 #3
--> \CheckFinished\a\b\c\relax
--> \OneStep\a\b\c\relax
(All the arguments are empty.)
--> \CheckFinished\relax
--> ``There were 3 copies of each control sequence.''
Thus, the following TeX
\RecognizeABC{\a\b\c}
\RecognizeABC{\a\a\b\b\c\c}
\RecognizeABC{\a\a\a\b\b\b\c\c\c}
Yields the following output:
There was 1 copy of each control sequence.
There were 2 copies of each control sequence.
There were 3 copies of each control sequence.
But TeX such as
\RecognizeABC{}
\RecognizeABC{\a}
\RecognizeABC{\a\a\b\b\c\c\c}
\RecognizeABC{\a\b\c\garbage}
fails.
Edit 1: In addition to minor bugfixes/cleanup and breaking the answer into sections with headings, I expanded the rationale for TeX being unparsable by anything short of a Turing machine to hopefully make it slightly more clear. This involved removing the previous \weird
example, replacing it with something that's more of a proof. If you'd rather have a concrete example (but non-proof) with a command which redefines itself, consider the following TeX, which was there before:
\def\weird[#1][#2]{%
\ifx#1\relax
\def\weird!{I was previously called with a relaxing first argument.}%
\else\ifx#2\relax
\def\weird(##1){I can only be called with one argument; I was given (##1).}%
\fi\fi
I was called with [#1] and [#2].}
To start, \weird
must be called as \weird[stuff][other stuff]
. But if its first argument is \relax
, then it redefines itself to require being called as \weird!
; if its second argument is nothing, it redefines itself to require being called as \weird(stuff)
; and it always typesets text. This can't be captured by a BNF grammar (I don't think). And of course, in the general case, we need a Turing machine.
First, Mike's answer is quite good. I will mostly expand on it and provide more details.
TeX
TeX is a language (a full programming language, actually) for typesetting documents. It originally output to a format called DVI which could then be converted to PostScript, PDF, etc.; more recent versions can output directly to PDF. You write a document with TeX instructions in it, and the TeX system will convert it into printable material.
TeX is used for a wide variety of documents, particularly in science and academia. Most people use it for things that other people would likely use Word for; however, the quality of its results are more on a par with InDesign or other major document layout packages, far superior what word processors generally yield. Designing specialized or ad-hoc document formats such as brochures, however, is probably easier with InDesign or QuarkXPress (although it is not impossible to do so in TeX/LaTeX).
TeX itself is quite low-level.
LaTeX
LaTeX is a macro package written in and for TeX that provides commands and defaults for writing larger documents at a higher level, taking care of things like sectioning, tables of contents, etc. In my experience, most TeX users do not write low-level TeX directly, but rather use LaTeX. LaTeX is not the only such package, though; ConTeXt is another macro package with a different design philosophy, but it sits at a similar level to LaTeX.
Usage
TeX and LaTeX are very widespread in some portions of academia, such as mathematics and computer science, due to its superb support for mathematical formulas. I have also heard that it is popular in some other disciplines as well, such as linguistics.
Best Answer
I don't know whether these are open problems or not, but since you are looking for a capstone project, you might be interested to explore if the basic algorithmic aspects of TeX can be improved.
Line-breaking algorithm. The current line breaking algorithm is a gold standard that all line-breaking software emulate, but is it the best way to break lines? Knuth and Plass's algorithm made specific premature optimization choices (pun intended!) like separating page-break from line break, assigning badness based on the raggedness of line but not accounting for rivers, etc. The only real advance since then has been character protrusion, and from what I understand, it still follows the same basic line-breaking algorithm. Now don't get me wrong. I am not saying that these choices were wrong. But a lot of these choices were made because the computational resources of that time could not really handle anything more sophisticated. But now that we have computers that are 1000 times more powerful than those in the 70s, it should be possible to explore other options to see if the line breaking algorithm can be improved by taking into account more factors, especially page-breaking, footnotes, side-notes, and floats. What is better, perfect line breaks but huge vertical spaces to balance the page, or slightly underfull lines but no vertical spaces? There is no way to play around with these in the current framework (please correct me if I am wrong).
Automatic breaking of display equations. Currently the
breqn
package implements the ideas of Michael J Downes, but AFAIK, the algorithmic aspects are not as well understood as that of line-breaking of text. Is it possible to case line-breaking of display equations as an optimization problem and determine a solution based on penalties and badness?Parsing natural math. There are recurring questions asking if it is possible to automatically translate
<=
to\le
,sin(x/y)
to\sin\left(\frac{x}{y}\right)
, etc. Although it is possible to do so to a varying degree of success with TeX and LuaTeX (e.g., thecalcmath
module in ConTeXt), I haven't seen any work that tries to understand how to parse math without markup. Given how sophisticated the current NLP techniques are, it should be possible to do better than simple heuristics for parsing natural math.