If I google for parse tree and TeX. All I'm getting are how to draw parse trees in TeX.
I want to know if there are programs that can parse TeX, and output a parse tree that's easy to manipulate.
parsing
If I google for parse tree and TeX. All I'm getting are how to draw parse trees in TeX.
I want to know if there are programs that can parse TeX, and output a parse tree that's easy to manipulate.
I've been looking into this too, so I'll share some observations that fall rather short of a proper answer, which would really involve looking at a whole lot of source code and asking the right questions about it.
Parsers generating HTML+Math ML
latex
with modified macros that insert specials into the DVI output, and parses the DVI output instead.Parsers generating XML
Jason Blevins has a list of tools that convert Latex documents to XML-based formats, and that handle equations reasonably. Romeo Anghelache's Hermes, which is part of a full Latex parser that generates XML with semantic markup, is worth singling out: like tex4ht, it works by running the Tex engine with macros to put specials in the DVI output, which it then parses; it supports a wider set of semantic markup.
Fragments of Latex or DVI
With the exception of the systems referencing Webtex, there doesn't seem to be much interest in clearly codifying subsets of Latex to be parsed, I guess because these are regarded as moving targets. Instead, lists of commands supported, like that I mentioned for Mathjax, seems to be the way things are done.
With DVI-based converters, the issue of parsing Latex goes away, replaced by the relatively trivial issue of parsing marked-up DVI and the trickier issue of identifying the semantically significant macros and constructing markup-issuing replacements that do not improperly interfere. I haven't looked at how this is done for equational layout. It would be a useful exercise to see how a converter from Tex formulae to those of It's worth noting that the representation of expressions is essentially a superset of that used by Heckmann & Wilhelm (1997) would work.
Syntax highlighting
A completely different kind of parsing is involved in syntax highlighting, where the idea is to help the author see the significance of the parts of the formulae. I don't know of any syntax highlighters that do an interesting job here: Auctex only raisers/lowers super&subscripts, but i haven't really looked.
Reference
Heckmann & Wilhelm, 1997, A Functional Description of TeX's Formula Layout.
With the experimental (but pretty much ready for release) package l3regex
(found in the l3experimental
bundle on CTAN), this task is a piece of cake.
\documentclass{article}
\usepackage{l3regex,xparse}
\ExplSyntaxOn
\seq_new:N \l_uiy_result_seq
\NewDocumentCommand { \UiySplit } { m }
{
\regex_extract_all:nnN { \D+ | \d+(?:\.\d*)? } {#1} \l_uiy_result_seq
\seq_map_inline:Nn \l_uiy_result_seq { item:~##1\par }
}
\ExplSyntaxOff
\begin{document}
\UiySplit{ljksadflh23898129hfafh0324.22234}
\end{document}
The \regex
line splits the user input #1
into pieces which either consist of one or more (+
) non-digits (\D
), or (|
) of one or more digits (\d
), followed maybe (?
acting on the group (...)
, which we want to be "non-capturing", done using (?:...)
) by a dot (\.
escaped dot, because the dot has a special meaning) and zero or more digits (\d*
). The line below maps through all the matches we found, with ##1
being a single match. Of course, you can do whatever you want to do with the items of the sequence \l_uiy_result_seq
.
Edit: The module also provides regular expression replacements. If I remember the syntax correctly, the following should work.
\ExplSyntaxOn
\seq_new:N \l_uiy_result_seq
\NewDocumentCommand { \UiySplit } { m }
{
\tl_set:Nn \l_uiy_result_tl {#1}
\regex_replace_all:nnN
{ (\D+) (\d+(\.\d*)) }
{ \c{uiy_do:nn} \cB{\1\cE} \cB{\2\cE} }
\l_uiy_result_tl
\tl_use:N \l_uiy_result_tl
}
\cs_new_protected:Npn \uiy_do:nn #1#2 { \use:c {#1} {#2} }
\ExplSyntaxOff
This time, I catch both the sequence of non-digits, and the number, as captured groups, \1
and \2
. Each such occurrence is replaced by the macro \uiy_do:nn
(the \c
escape in this case indicates "build a comman"), then a begin-group (\cB
) character {
(this time, \c
indicates the category code), then the non-digits (\1
), then an end-group (\cE
) character }
, then another \cB{
, the number, and a closing \cE}
.
After that, the token list looks like \uiy_do:nn {ljksadflh} {1}
. We then simply use its contents with \tl_use:N
. The final step is to actually define \uiy_do:nn
. Here, I defined it as simply building a command from #1
, and giving it the argument #2
. This very simple action could be done at the replacement step using \c{\1}
for "build a command from the contents of group \1
", and technically it would be slightly better, producing an "undefined control sequence" error if the relevant command is not defined. Another option for that error detection to happen is to replace \use:c {#1} {#2}
by \cs_if_exist_use:cF {#1} { \msg_error:nnx { uiy } { undefined-command } } {#2}
, with an appropriately defined error message.
Best Answer
TeX is not built along the classical compiler architecture of a scanner and a parser that builds a parse tree. (When TeX was invented these concepts where much less clear than they are today.) And since TeX does not use a parse tree internally you are unlikely to find a tool that represents TeX code accurately in such a way. If you are serious about manipulating TeX code I'd suggest you tell us what you want to achieve - maybe there is another way to do it. One tool I would consider looking at is Hevea.