[Tex/LaTex] From old paper documents to modern LaTeX versions

conversion

I'm more or less in the process of converting old (let's say from the 70's) paper documents into their modern LaTeX version. What I am doing is as follows:

  1. write the whole Latex source from scratch (very long and inefficient)
  2. scan the document, use an OCR tool and translate the subsequent txt file into a tex file

I'm also thinking of using voice recognition software to go a bit faster. Are some of you in the same situation and what would be your advices to accelerate the whole process? The final goal is to share the PDF documents and related LaTeX source on an open archive and give revival to these "about to die" interesting documents.

Edit 1: work has also to be done on figures and diagrams. As far as I know, it is almost impossible to automate this task. So, I'm currently redrawing everything with either Inkscape, TikZ or pstricks.

Edit 2: Tesseract-ocr is willing to help but it is not high priority. Anyway, it looks like Tesseract can be trained.

Best Answer

I have the following workflow:

  1. Scan pages into series of tiff images.
  2. Process them with Scan Tailor for fix orientation, split pages and to get b/w images
  3. Join resulting images to multi page tiff with the tiffcp command from libtiff

Then I use finereader, because result have to be in the RTF format.

But, open source OCR engines like Cuneiformor Tessseract have recently good results and they can export text in HOCR format. HOCR is in fact HTML with information about paragraphs, page and line breaks and other elements of the page. It should be possible to write some script for conversion from this format to LaTeX.

Illustrations are another problem, you can vectorize them with potrace or autotrace. You can use potrace from inkscape. Results are good for illustrations, but I don't know if they are usable for diagrams or graphs.