[Tex/LaTex] LaTeX to plain text for e.g. generation of statistics


I would like to convert a large LaTeX project (i.e. spanning multiple files) into plain text. The purpose is generation of statistics, so representing mathematics is not an issue. In fact, all mathematics is ideally ignored.

I have found http://code.google.com/p/textricks/ but could not get it to run. It seems unfinished, but is exactly what I am looking for otherwise.

Best Answer

I would compile the document into a PDF and then use pdftotext to convert it to a text file. You should disable all hyphenation and remove the page header (\pagestyle{empty}) to get only the raw text. This ensures that you are using the LaTeX output not the input which might differ.

Of course, if you want to do statistics about the LaTeX files and not the document generated by it, then you need to convert the source instead. Stripping all macros is very difficult because (La)TeX is quite dynamic and macros can be redefined anyway (even by itself). A full correct stripping could only be implemented in (La)TeX itself because of this and then it is still very difficult. Some tools might simply remove all macros and brace arguments. This might be enough for your tasks.