[Tex/LaTex] Generate a parse tree for TeX

parsing

If I google for parse tree and TeX. All I'm getting are how to draw parse trees in TeX.

I want to know if there are programs that can parse TeX, and output a parse tree that's easy to manipulate.

Best Answer

TeX is not built along the classical compiler architecture of a scanner and a parser that builds a parse tree. (When TeX was invented these concepts where much less clear than they are today.) And since TeX does not use a parse tree internally you are unlikely to find a tool that represents TeX code accurately in such a way. If you are serious about manipulating TeX code I'd suggest you tell us what you want to achieve - maybe there is another way to do it. One tool I would consider looking at is Hevea.

Related Solutions

[Tex/LaTex] What parsers for (La)TeX mathematics exist outside of the TeX engines

I've been looking into this too, so I'll share some observations that fall rather short of a proper answer, which would really involve looking at a whole lot of source code and asking the right questions about it.

Parsers generating HTML+Math ML

Nick Drakos & Ross Moore's Latex2html converter, written in Perl, which I think was the first converter to map equations to Math ML. In 1998, Ross Moore outlined his goals for Latex2html, tied to the now defunct, closed-source WebEq mathematics rendering software, and Webtex, which was an alternative syntax for mathematics designed for use in web pages. From the WebEq documentation: WebTeX always translates unambiguously into MathML, while LaTeX does not.
itex2mml, in C by Paul Gartside & others, also based on Webtex, but with support for some Latex not supported in Webtex.
tex4ht, written in C by Eitan Gurari and other eminent figures. It avoids having to parse Latex source by running latex with modified macros that insert specials into the DVI output, and parses the DVI output instead.
John McFarlan's Pandoc, as mentioned by Aditya, written in Haskell. Note that Pandoc supports generation of HTML, both with and without Math ML.
MathJax allows generation of Math ML besides the usual boxes plus image fonts output. It has an impressive degree of support for Latex, including limited support for user macros.

Parsers generating XML

Jason Blevins has a list of tools that convert Latex documents to XML-based formats, and that handle equations reasonably. Romeo Anghelache's Hermes, which is part of a full Latex parser that generates XML with semantic markup, is worth singling out: like tex4ht, it works by running the Tex engine with macros to put specials in the DVI output, which it then parses; it supports a wider set of semantic markup.

Fragments of Latex or DVI

With the exception of the systems referencing Webtex, there doesn't seem to be much interest in clearly codifying subsets of Latex to be parsed, I guess because these are regarded as moving targets. Instead, lists of commands supported, like that I mentioned for Mathjax, seems to be the way things are done.

With DVI-based converters, the issue of parsing Latex goes away, replaced by the relatively trivial issue of parsing marked-up DVI and the trickier issue of identifying the semantically significant macros and constructing markup-issuing replacements that do not improperly interfere. I haven't looked at how this is done for equational layout. It would be a useful exercise to see how a converter from Tex formulae to those of It's worth noting that the representation of expressions is essentially a superset of that used by Heckmann & Wilhelm (1997) would work.

Syntax highlighting

A completely different kind of parsing is involved in syntax highlighting, where the idea is to help the author see the significance of the parts of the formulae. I don't know of any syntax highlighters that do an interesting job here: Auctex only raisers/lowers super&subscripts, but i haven't really looked.

Reference

Heckmann & Wilhelm, 1997, A Functional Description of TeX's Formula Layout.

[Tex/LaTex] Parse a string into tokens of numbers and not numbers

With the experimental (but pretty much ready for release) package l3regex (found in the l3experimental bundle on CTAN), this task is a piece of cake.

\documentclass{article}
\usepackage{l3regex,xparse}
\ExplSyntaxOn
\seq_new:N \l_uiy_result_seq
\NewDocumentCommand { \UiySplit } { m }
  {
    \regex_extract_all:nnN { \D+ | \d+(?:\.\d*)? } {#1} \l_uiy_result_seq
    \seq_map_inline:Nn \l_uiy_result_seq { item:~##1\par }
  }
\ExplSyntaxOff
\begin{document}
  \UiySplit{ljksadflh23898129hfafh0324.22234}
\end{document}

The \regex line splits the user input #1 into pieces which either consist of one or more (+) non-digits (\D), or (|) of one or more digits (\d), followed maybe (? acting on the group (...), which we want to be "non-capturing", done using (?:...)) by a dot (\. escaped dot, because the dot has a special meaning) and zero or more digits (\d*). The line below maps through all the matches we found, with ##1 being a single match. Of course, you can do whatever you want to do with the items of the sequence \l_uiy_result_seq.

Edit: The module also provides regular expression replacements. If I remember the syntax correctly, the following should work.

\ExplSyntaxOn
\seq_new:N \l_uiy_result_seq
\NewDocumentCommand { \UiySplit } { m }
  {
    \tl_set:Nn \l_uiy_result_tl {#1}
    \regex_replace_all:nnN
        { (\D+) (\d+(\.\d*)) }
        { \c{uiy_do:nn} \cB{\1\cE} \cB{\2\cE} }
        \l_uiy_result_tl
    \tl_use:N \l_uiy_result_tl
  }
\cs_new_protected:Npn \uiy_do:nn #1#2 { \use:c {#1} {#2} }
\ExplSyntaxOff

This time, I catch both the sequence of non-digits, and the number, as captured groups, \1 and \2. Each such occurrence is replaced by the macro \uiy_do:nn (the \c escape in this case indicates "build a comman"), then a begin-group (\cB) character { (this time, \c indicates the category code), then the non-digits (\1), then an end-group (\cE) character }, then another \cB{, the number, and a closing \cE}.

After that, the token list looks like \uiy_do:nn {ljksadflh} {1}. We then simply use its contents with \tl_use:N. The final step is to actually define \uiy_do:nn. Here, I defined it as simply building a command from #1, and giving it the argument #2. This very simple action could be done at the replacement step using \c{\1} for "build a command from the contents of group \1", and technically it would be slightly better, producing an "undefined control sequence" error if the relevant command is not defined. Another option for that error detection to happen is to replace \use:c {#1} {#2} by \cs_if_exist_use:cF {#1} { \msg_error:nnx { uiy } { undefined-command } } {#2}, with an appropriately defined error message.

Best Answer

Related Solutions

[Tex/LaTex] What parsers for (La)TeX mathematics exist outside of the TeX engines

[Tex/LaTex] Parse a string into tokens of numbers and not numbers

Related Question