[Tex/LaTex] TeX as the basis for data driven document generation system

automationtemplates

I am in the early stages of trying to replace an existing document generation system based on Microsoft Office apps (Word and Excel).

The system runs once a year to generate over 300,000 individual documents of three different types (the new system may be an "on demand" for an individual document instead of batch processing all at once). Two of the types are mostly standard text with value place-holders, and some conditional logic for inclusion/exclusion of certain elements (standard paragraphs and/or values). One of the types includes a bar chart based on individual calculations. All of the values for replacement and conditional logic are sourced from an ASCII file, but some of the values are calculated from the data.

The current system is very slow and error prone (at run time) and requires a complex system of machines, threads and message queues to scale the processing resources to a level that will get the job done within two weeks or so. Basically, there are three Word Document templates that include the value place-holders and conditional logic and text. The templates are processed with Office interop libraries to create an instance document. In the case of one of the types, Excel is used to create a bar chart that is injected (OLE embedded) into the instance document. The Word instance documents are then converted to (saved as) PDF.

I only know a little about TeX (brushed up against it during my many years of Emacs use), but it seems that it might serve as a good basis to replace the behemoth described above. The problem is that I need some guidance as to whether or not TeX would be a good route to persue (performance being a key factor), and some pointers to resources that can accomplish the more obscure needed tasks (I know PDF generation is no issue).

The final system would execute on Windows machine(s), and programmatic processing would be done with .NET or Java most likely.

Best Answer

First of all, I can only chime in with commentors to say that TeX is certainly one of the best systems you can find for this kind of task.

As your question is not very specific and some pointers have already been given, let me just give some examples for comparable uses and suggestions for proceeding further. Feel free to ask more specific questions ;-)

One example of a (commercial) data driven document generation system implemented entirely in TeX is my DocScape. You can find some references here; I also gave some examples in this answer.

To give you some numbers on performance: A german federal government ("Niedersachsen") is using a TeX-based system for publishing budget documents (budget plan, reports and a lot of other stuff). About 16,000 people from the state administration are involved, maintaining data. Every one of them can generate a preview (of about 2-10 pages) at any time, which leads to up to 300 documents generated in parallel at any time (on an AIX mainframe).

Once a year, several volumes of 1000+ pages are generated for print, plus a couple of intermediate versions.

See for instance the last budget report.

In general, I'm probably not the right person to comment on performance, because DocScape is sadly rather inefficient from a macro programming point of view, so I can't really report any speed records for my own projects.

TeX itself, on the other hand, certainly is a role model of efficiency, because it hasn't been affected by software bloat for at least the last 30 vears. So you shouldn't run into any performance problems at all. In particular if you're generating a lot of independent documents, you can run as many TeX processes in parallel as you have processors in your machine, further speeding up things.

Here some further hints on how to proceed.

  • First, I would preprocess the input (ASCII) data, transforming either into XML or some "pseudo" TeX data notation.
  • This doesn't mean you need to generate complete TeX documents, but at least inserting some control sequences to markup document and data structures, images etc. will make later processing with TeX much easier.
  • I would by all means do all the number crunching during preprocessing, in particular for the bar charts. The charts themselves can then be plotted with TikZ.