[Tex/LaTex] TeX as the basis for data driven document generation system

automationtemplates

I am in the early stages of trying to replace an existing document generation system based on Microsoft Office apps (Word and Excel).

The system runs once a year to generate over 300,000 individual documents of three different types (the new system may be an "on demand" for an individual document instead of batch processing all at once). Two of the types are mostly standard text with value place-holders, and some conditional logic for inclusion/exclusion of certain elements (standard paragraphs and/or values). One of the types includes a bar chart based on individual calculations. All of the values for replacement and conditional logic are sourced from an ASCII file, but some of the values are calculated from the data.

The current system is very slow and error prone (at run time) and requires a complex system of machines, threads and message queues to scale the processing resources to a level that will get the job done within two weeks or so. Basically, there are three Word Document templates that include the value place-holders and conditional logic and text. The templates are processed with Office interop libraries to create an instance document. In the case of one of the types, Excel is used to create a bar chart that is injected (OLE embedded) into the instance document. The Word instance documents are then converted to (saved as) PDF.

I only know a little about TeX (brushed up against it during my many years of Emacs use), but it seems that it might serve as a good basis to replace the behemoth described above. The problem is that I need some guidance as to whether or not TeX would be a good route to persue (performance being a key factor), and some pointers to resources that can accomplish the more obscure needed tasks (I know PDF generation is no issue).

The final system would execute on Windows machine(s), and programmatic processing would be done with .NET or Java most likely.

Best Answer

First of all, I can only chime in with commentors to say that TeX is certainly one of the best systems you can find for this kind of task.

As your question is not very specific and some pointers have already been given, let me just give some examples for comparable uses and suggestions for proceeding further. Feel free to ask more specific questions ;-)

One example of a (commercial) data driven document generation system implemented entirely in TeX is my DocScape. You can find some references here; I also gave some examples in this answer.

To give you some numbers on performance: A german federal government ("Niedersachsen") is using a TeX-based system for publishing budget documents (budget plan, reports and a lot of other stuff). About 16,000 people from the state administration are involved, maintaining data. Every one of them can generate a preview (of about 2-10 pages) at any time, which leads to up to 300 documents generated in parallel at any time (on an AIX mainframe).

Once a year, several volumes of 1000+ pages are generated for print, plus a couple of intermediate versions.

See for instance the last budget report.

In general, I'm probably not the right person to comment on performance, because DocScape is sadly rather inefficient from a macro programming point of view, so I can't really report any speed records for my own projects.

TeX itself, on the other hand, certainly is a role model of efficiency, because it hasn't been affected by software bloat for at least the last 30 vears. So you shouldn't run into any performance problems at all. In particular if you're generating a lot of independent documents, you can run as many TeX processes in parallel as you have processors in your machine, further speeding up things.

Here some further hints on how to proceed.

First, I would preprocess the input (ASCII) data, transforming either into XML or some "pseudo" TeX data notation.
This doesn't mean you need to generate complete TeX documents, but at least inserting some control sequences to markup document and data structures, images etc. will make later processing with TeX much easier.
I would by all means do all the number crunching during preprocessing, in particular for the bar charts. The charts themselves can then be plotted with TikZ.

Related Solutions

[Tex/LaTex] Is there a common pgfplots styles/templates repository

A template library for different plots styles would not work, because there a to many different kinds of data and ways to present the data. Futhermore the style depends on the personal taste. For example I prefer to have the ticks outside of the plot, other find it more elegant to have them inside of the plot. Next question is if the opposide axis should have ticks or not.

Coming back to your question, you do not have to repeat yourself by "copy and paste". You can define your own styles and reuse them. Hence if you change your mind or your supervisor you only have to change it in one place. Here is one example:

\documentclass{standalone}

\usepackage{pgfplots}
\pgfplotsset{compat=1.5}
\usepackage{siunitx}
\SendSettingsToPgf

% define a general plot style
\pgfplotsset{general plot/.style={ 
        xtick pos=left,
        ytick pos=left,
        enlarge x limits=false,
        minor x tick num=1,
        every x tick/.style={color=black, thin},
        every y tick/.style={color=black, thin},
        tick align=outside,
        xlabel near ticks,
        ylabel near ticks,
    } 
}

% define a plot style for absorbance
\pgfplotsset{ir absorbance/.style={
        general plot, % reuse the general plot style
        x dir= reverse,
        ytick = \empty,
        % insteed of hard coding the unit you could also use
        % the pgfplots unit library
        xlabel=Wavenumber (\si{\per\centi\metre}),
        ylabel=Absorbance (a.\,u.),
    }
}

\pgfplotsset{ir absorbance data/.style={mark=none}}

\begin{document}
    \begin{tikzpicture}
        \begin{axis}[ir absorbance,
            domain=2000:2200, samples=100 % only needed for the function plottiong
        ]
            \addplot[ir absorbance data] { % here you would have: table[...] {mydata.txt}
                exp(-((x-2080)^2/40))
            };
        \end{axis}
    \end{tikzpicture}
\end{document}

Example

[Tex/LaTex] Trying to create simple template for novice users

Here's the framework for an answer, based on @Ignasi's suggestion. Documentation for the combine package (http://tug.ctan.org/tex-archive/macros/latex/contrib/combine/combine.pdf) will help you play with formatting, make the toc work, even generate an index and a table of figures.

The student gets two files, preamble.tex (read only if possible) and template.tex. She fills in the template, compiles it and submits it with a new name along with her image files (you have to establish the naming conventions).

The preamble:

% preamble.tex
% provided to students, read only
\documentclass[12pt,notitlepage]{article}
\usepackage{graphicx}

\newcommand{\theschool}{to be renewed}
\newcommand{\school}[1]{%
   \renewcommand{\theschool}{#1}
}
\newcommand{\theteacher}{to be renewed}
\newcommand{\teacher}[1]{%
   \renewcommand{\theteacher}{#1}
}

% hack the \date field of \maketitle
\date{School: \theschool{} -- Teacher: \theteacher{}} 

\newcommand{\myfigure}[2]{%
\centering
\includegraphics[height=3cm]{#1}\\
Caption: #2
}
\begin{document}
\newcommand{\alldone}{\end{document}}

template.tex, saved as plato.tex:

% template for students to fill in
% 
\input{preamble}
\author{Plato}
\title{The Republic}
\school{Athens}
\teacher{Socrates}
\maketitle
\begin{abstract}
   \emph{The Republic} in one short paragraph \ldots
\end{abstract}
% first argument is image (.jpg, .png, .pdf)
% second argument is figure caption
\myfigure{therepublic}{Image from wikipedia}
\alldone

Compiles to

enter image description here

The wrapper is putittogether.tex, in the directory with student submissions and an empty preamble.tex. I compiled and tested it with a second saved template - code not included here.

% putittogether.tex
\documentclass[12pt]{combine}

% macros from the preamble seen by the students
\usepackage{graphicx}
\newcommand{\theschool}{to be renewed}
\newcommand{\school}[1]{%
   \renewcommand{\theschool}{#1}
}
\newcommand{\theteacher}{to be renewed}
\newcommand{\teacher}[1]{%
   \renewcommand{\theteacher}{#1}
}
\newcommand{\hackdate}{%
  \date{School: \theschool{} -- Teacher: \theteacher{}} 
}
\newcommand{\myfigure}[2]{%
\centering
\includegraphics[height=3cm]{#1}\\
Caption: #2
}
% The \date must be renewed between \imports
\newcommand{\goget}[1]{%
  \hackdate{}\import{#1}
}
\newcommand{\alldone}{} % do nothing

% Combine package configuration
\title{All Together Now}
\author{many authors}
\date{\today}

\begin{document}
\pagestyle{combine} % use the combine page style
\maketitle % main title
\tableofcontents % main ToC
\clearpage

% The files to glue together - all in this directory,
% along with all graphics files required.
%
% Generate this list with a script, then \include it here.

\goget{plato}
\goget{vonneumann}

\end{document}

Best Answer

Related Solutions

[Tex/LaTex] Is there a common pgfplots styles/templates repository

[Tex/LaTex] Trying to create simple template for novice users

Related Question