[Tex/LaTex] What parsers for (La)TeX mathematics exist outside of the TeX engines

math-modemathjaxparsingtools

Inspired by the author's motivation for asking Is there a BNF grammar of tex language.

Are there any well done libraries that can parse some subset of TeX mathematics independently of the TeX engine? Important points to consider in answers:

  • How large of a subset of TeX mathematical notation is supported?

  • Is the parser portable? Does it have any dependencies?

  • Is the parser closely tied to a particular backend or could it easily be used to support multiple output formats. In other words, how easily could it be integrated into a new system that had to support output to PDF, HTML, PNG, etc.

For example, I know of the following parsers but not much about their applicability outside the use cases for which they were designed (Matplotlib graphics and math rendering in web browsers):

Best Answer

I've been looking into this too, so I'll share some observations that fall rather short of a proper answer, which would really involve looking at a whole lot of source code and asking the right questions about it.

Parsers generating HTML+Math ML

  1. Nick Drakos & Ross Moore's Latex2html converter, written in Perl, which I think was the first converter to map equations to Math ML. In 1998, Ross Moore outlined his goals for Latex2html, tied to the now defunct, closed-source WebEq mathematics rendering software, and Webtex, which was an alternative syntax for mathematics designed for use in web pages. From the WebEq documentation: WebTeX always translates unambiguously into MathML, while LaTeX does not.
  2. itex2mml, in C by Paul Gartside & others, also based on Webtex, but with support for some Latex not supported in Webtex.
  3. tex4ht, written in C by Eitan Gurari and other eminent figures. It avoids having to parse Latex source by running latex with modified macros that insert specials into the DVI output, and parses the DVI output instead.
  4. John McFarlan's Pandoc, as mentioned by Aditya, written in Haskell. Note that Pandoc supports generation of HTML, both with and without Math ML.
  5. MathJax allows generation of Math ML besides the usual boxes plus image fonts output. It has an impressive degree of support for Latex, including limited support for user macros.

Parsers generating XML

Jason Blevins has a list of tools that convert Latex documents to XML-based formats, and that handle equations reasonably. Romeo Anghelache's Hermes, which is part of a full Latex parser that generates XML with semantic markup, is worth singling out: like tex4ht, it works by running the Tex engine with macros to put specials in the DVI output, which it then parses; it supports a wider set of semantic markup.

Fragments of Latex or DVI

With the exception of the systems referencing Webtex, there doesn't seem to be much interest in clearly codifying subsets of Latex to be parsed, I guess because these are regarded as moving targets. Instead, lists of commands supported, like that I mentioned for Mathjax, seems to be the way things are done.

With DVI-based converters, the issue of parsing Latex goes away, replaced by the relatively trivial issue of parsing marked-up DVI and the trickier issue of identifying the semantically significant macros and constructing markup-issuing replacements that do not improperly interfere. I haven't looked at how this is done for equational layout. It would be a useful exercise to see how a converter from Tex formulae to those of It's worth noting that the representation of expressions is essentially a superset of that used by Heckmann & Wilhelm (1997) would work.

Syntax highlighting

A completely different kind of parsing is involved in syntax highlighting, where the idea is to help the author see the significance of the parts of the formulae. I don't know of any syntax highlighters that do an interesting job here: Auctex only raisers/lowers super&subscripts, but i haven't really looked.

Reference

Heckmann & Wilhelm, 1997, A Functional Description of TeX's Formula Layout.