[Tex/LaTex] Are there many cases where Pandoc fails on Word/.docx to TeX conversion of formulas/math

conversionjournal-publishingmath-modemswordpandoc

How reliable is a LaTeX (XeTeX) publishing workflow which depends on Pandoc to handle submissions in .docx format from recent versions of Word? I can anticipate submissions with a fair amount of math, formulas, and symbols in them produced by Word's Equation Editor. Does Pandoc convert them reliably to LaTeX/XeTeX in most cases? Are there specific kinds of cases or expressions on which Pandoc generally fails?

(The math example with one formula provided in Pandoc's documentation is converted well from .docx to LaTeX by Pandoc. So it works on a minimal working example. But I want to know about the full range of output from Word's Equation Editor, and I don't have a maximal possibly-not-working example!)

Best Answer

In my experience, docx to LaTeX conversion of math works well, provided the document uses the new (now standard) equation editor, not the old equation 3.0 -- or whatever the name is -- which is still supported in the docx format.

One problem comes from utf-8 symbols, such as greek letters, which sometimes appear in the converted document in the original form, not the latex equivalent. This can be quite easily solved by a replace script that can handle these symbols.

Other problems arise from complex formatting, e.g. headings, footnotes etc. I guess that all of these conversions are implemented well, however, in a real Word document, the authors often either do not use the formatting consistently or even wrongly. For example, some low-level heading can be in Word used "equivalently" to a boldface ("equivalently" in a sense that the output looks the same). When converted to latex, this text is replaced e.g. by a \subsubsection, which obviously was not the intention.

Other than that, older versions of Word contain some internal "labels", that are still supported by the new versions, however, when encountered by pandoc, they are completely dropped. In my experience, this was the case with some unit conversion tags which allows to automatically convert the document from metric to imperial units etc. While this may look like a not very probable scenario, note that Word 2007 includes these tags automatically, without the writer's knowledge.


TLDR: Pandoc is a great tool (in my opinion the best amongst the free software), however, a fair amount of manual work may be required after the conversion. Also, proofreading is necessary.

Note: This is my personal experience, I'm not a pandoc expert. Maybe some of these problems can be solved by a proper configuration.

Related Question