[Tex/LaTex] How to convert office documents (odt, doc, rtf) to clean LaTeX

conversion

[I know, this question has popped up several times, but let me clarify my situation.]
I have to edit certain books and the authors send me their documents as .doc, .odt or .rtf files. I want to convert those files to LaTeX, but the only structure and format I want to keep is:

  • italic
  • footnotes
  • ordered and unordered lists

So, what I usually do, is to convert those documents to .rtf using Libreoffice (because I get rid of many formats I don't want to keep this way), and then with the famous LibreOffice extension I convert it to LaTeX.
This works well, but sometimes the result is not clean, for example many times there are colour tags or some other formats within the LaTeX document.

Does anybody know a better way to achieve that goal?
How to convert those documents, keeping only footnotes, italic, and lists?

Best Answer

I know what a typical document looks like in the field of philosophy and thus I understand that the formatting is usually not very complex, but works of philosophy are, ideally speaking, structured and semantically-rich documents. It might be worth it to consider e.g. whether the italics are meant to convey emphasis or distinguish a book title and mark them up accordingly. Also, keeping the outline and hierarchy of the headings intact might be beneficial. Which brings me to Writer2LaTeX.

I personally do not see any advantage in using RTF in this day and age and my work revolves around OpenDocument. When I have to integrate different documents and produce a clean typeset version with LaTeX (which happens more and more often), I usually rely on a combination of Pandoc and Writer2LaTeX. Here are some observations on the latter:

  • In order to preserve all the different kinds of styles and structural elements, one has no choice but to choose a “print” profile; for what I do, the “clean” and “ultraclean” profiles are inadequate. One reason is that I often have multi-column text sections in my ODT documents (I mean actual sectionsdiv’s or environments if you like – not section-level headings) and these sections are not preserved in Writer2LaTeX’s output when a “clean” profile is selected. Then again, even with the “print” profile what gets produced is a multicols environment and not a custom environment with its own name, which is not as semantic as I’d like.

  • The results of the conversion are usually better when all manual formatting is removed and replaced with styles, even for character styles, which is why I suggest replacing italics with styles like e.g. “Emphasis” and “Work-title”. That way, every style has its own definition in the generated LaTeX preamble.

  • Even when trying to be as disciplined as possible, there is always cruft to remove. Writer2LaTeX respects the way that LibreOffice writes its ODF files, which means that many automatic styles are produced, even when a named style was applied. It is as though LibreOffice “instantiates” the applied styles and replaces them with automatically numbered ones (P1, P2 etc. for paragraphs, T1, T2 etc. for text spans). I find this very irritating, especially because this behavior is only present in recent versions of LibreOffice and is not present in OpenOffice 3.4.1.

  • In most cases, you will want to modify the command and environment definitions in the generated preamble, because they are not at all optimal (obviously, the use of an automated tool cannot replace actual knowledge of LaTeX).

Discipline and patience are required, but it is definitely possible to produce beautiful documents with OpenDocument, LibreOffice, Writer2LaTeX and some careful manual editing of the resulting LaTeX source.

Update: between 2012 and 2014, Writer2LaTeX’s development remained dormant and this was disconcerting, because the latest stable version did not work properly with LibreOffice 4.0 and later. My solution was to use the “standalone” version of Writer2LaTeX, available from the project’s website. This standalone version is operated from the command line, but I actually find it simpler to configure that way.

Now, since the end of 2014, development of Writer2LaTeX has resumed and it is better than ever. Compatibility with XeTeX has improved, making the extension much easier to use in a typical multilingual environment.