[Tex/LaTex] Training Tesseract with generated LaTeX files


My problem

Tesseract is an OCR that can be trained.

I need to train it to take into account a new specific font (which is a ttf font).

I found this discussion that indicates it is possible to generate training files with LaTeX. However, the script files are no more available.

So I began to try to create my own script files but I am now stuck with generating the right files with LaTeX.

How a Tesseract training set is made

To make a Tesseract training set, I need to generate two files: a .box file and an image file (e.g. .tif or .png).

Both files are generated from a training text that I have extracted from an existing Tesseract training set. This training text contains at least 4700 characters.

Here is an example of training text:

    « : : :R: : :» Re: û® 03 vue ë ’ Logiciels septembre ¢§ petite numérique ÈQ été Yû Windows “I” ^Z Commentaires forum privé
    depuis Dernière Annuaire une vendeur ° # “au-delà ” est français FAQ «% moment car œ? | ` fois JuG 14 plus HP Culture

For now, using this text, I generate a pdf file (using OpenOffice). This pdf is then converted to an image (using imagemagick).

I then use tesseract to generate a .box file. This .box file should contain the coordinates of each box that contains a character.


m 861 2621 893 2644 0
m 895 2617 921 2648 0
e 922 2621 938 2644 0
m 939 2622 961 2644 0
b 963 2621 978 2649 0

Which correspond to the following image:

Incorrect boxes definition

As you can see, the problem here is that the se and pt characters are replaced by a m and that they are in the same box.

I should have this:

    s 861 2621 877 2644 0
    e 877 2621 893 2644 0
    p 895 2617 908 2648 0
    t 908 2617 921 2648 0
    e 922 2621 938 2644 0
    m 939 2622 961 2644 0
    b 963 2621 978 2649 0

Which correspond to the almost correct image here:

splitted boxes

As you can see, each letter is inside its own box.

t has same bottom as p here because I edited the .box file and splitted the original box that contained both characters.

I have to correct manually the coordinates to get this:

s 861 2621 877 2644 0
e 877 2621 893 2644 0
p 895 2617 910 2644 0
t 910 2622 921 2648 0
e 922 2621 938 2644 0
m 939 2622 961 2644 0
b 963 2621 978 2649 0

Which matches with the following image:

Boxes close to letter edges

So I have to correct manually the training set (the 4700+ characters !) using a specific software (jTessBoxEditor) which is very long and painful.

Box file format

Each line matches a character (the file must be in UTF-8 however generating a file with another encoding should make it easy to convert to UTF-8)

Each line gives a character and its position in the image file from bottom-left (0; 0).

e 922 2621 938 2644 0 gives the lower-left coordinates (x=922 and y=2621) and upper-right coordinates (x=938 and y=2644) of the box that contains a e character.

The last number on each line is the page number (0-based) : here it is on the first page (0).

More information can be found here.

Why not using LaTeX ?

The problem is that OpenOffice only generates a PDF (image) and I don't know how to parse PDF file to get the position of each character (nor know if it is possible).

I then remembered that LaTeX used boxes to place each character into a page. And LaTeX "knows" where are each box. So I found the already mentioned discution and decided to give it a try.

I read in that discussion that they used the following process so I tried to do the same:

a) You install MikTeX and some package for it that allows latex to understand kannada language, as I know there is package called itrans that work with this. However you need to provide input for it using latin transliteration.

b) You prepare training text (using transliteration) and process it with itrans and latex. You get at this stage a .dvi file that is typeseted in kannada language and contains (in a cryptic form) all the information about the character boxes of your text. You extract this information using my perl script

dvitype <file.dvi>| perl script1.pl > file.texbox

c) You produce training image for tesseract. You can do this electronically using dvips + ghostscript

dvips -o <ps.file> <dvi.file>

gs -r300x300 -dNOPAUSE -sOutputFile=<file.tif> -sDEVICE=tiffg4 <ps.file>

or by printing dvi file on printer and then scanning it.

You run tesseract <file.tif> <file.txt> batch.nochop makebox

You rename file.txt into file.box.

d) You produce final box file for training tesseract using my second perl script

perl correlatebox.pl file.texbox file.box > result_file.box

Unfortunately, the .pl files are missing for now… and I am not an expert with DVI files, font files.

What I tried

My guess is to output a parsable DVI file (using dvitype) to get the boxes' coordinates and then generate an image from the DVI.

But when I use lualatex -output-format dvi training.helveticacomp.tex, I get some errors with the font (HelveticaComp.ttf).

But when I use xelatex -no-pdf training.helveticacomp.tex, I get a xdv file that I can not parse using dvitype utility.

I installed the missing components on my basic TexLive distribution: dvitype from texware, disdiv from dtl (this is to disassemble xdv files with disdvi -x <xdv file>)

I found interesting information in :

I tried to follow this document to install my needed font (because lualatex complained about a missing font) but I still have some errors. And I am not sure I want to install a complete new font into my LaTeX because it seems difficult, so I did not try it yet.

Here is an example of my .tex file:



‘ kit Contacts Carte Type un forme ç~ avant BYW: EN monde 2001 qu'on plan image ZG 23 À+ niveau femmes

My question

I need some leads to generate a correct .box file associated with a .png (or .tif) file using LaTeX.

If possible, it would avoid the installation of new font.

Any help would be appreciated on how to parse the xdv or dvi file to obtain the .box and how to correctly manage my font.

Any other solution that produces the same output (.box + image file) is of course welcome.

I suppose this would interest anyone who wants to train Tesseract without spending hours with .box editing.

My computer

I use a basic TexLive installation on Mac OS X (10.10).

Best Answer

So I investigated option of using LuaTeX's node processing callbacks. Best suited is pre_output_filter which is called when page is ready for the output. I've created simple package, named boxes, which consists of two files: LaTeX package boxes.sty and Lua module boxes.lua.


  main_language = "\boxes@lang"
  resolution    = tonumber("\boxes@resolution")
  startx        = tex.sp("\boxes@startx")
  starty        = tex.sp("\boxes@starty")

  print("language", main_language)
  local boxes = require "boxes"
  boxes.resolution = resolution
  boxes.startx = startx
  boxes.starty = starty
    function(head,info, size, pack, maxdpth) 
      local f = font.getfont(font.current())
      local fontname = f.psname or f.fullname
      local name = string.format("%s.%s.exp%i.box", main_language, fontname,0)
      local glyphs = boxes.traverse(head)
      if #glyphs > 0  then boxes.save(name, glyphs) end
      return head
    end, "Save node boxes")

this file is simple, important to note are package options:

  • lang: processed language
  • resolution: I haven't found at which resolution boxes file should be saved and I think it is a good idea to make it configurable. default is 72 ppi.
  • startx and starty - I can't find a way how to calculate top left beginning of text block, this really depends on used document class or packages like geometry, I hope there is some way to determine it, but I don't know how. so we must set these values by hand, using some experimenting

and now the lua module, boxes.lua:

local boxes = {}
boxes.resolution =  300 --72
local pt = 2 ^ 16
local uchar = unicode.utf8.char
local total_height = tex.pageheight
local pagebox = tex.pdfpagebox
local baselineskip = tex.baselineskip.width

local function round(num, idp)
  local mult = 10^(idp or 0)
  return math.floor(num * mult + 0.5) / mult

local function make_dimensions(glyph, x, y)
  local resolution = boxes.resolution
  local bp = (2 ^ 16) / (resolution / 72.27)
  local lx = round(x / bp)
    local ly = round((total_height - (y + glyph.depth)) / bp)
    local rx = round((x + glyph.width) / bp)
    local ry = round((total_height - (y - glyph.height)) / bp)
    return lx,ly, rx,ry
function boxes.traverse(head) 
  --local set = head.glue_set
    --local sign = head.glue_sign
    --local order = head.glue_order
  local resolution = boxes.resolution
  local bp = (2 ^ 16) / (resolution / 72.27)
    local glyphs = {}
    local i = 0
  for n in node.traverse(head) do
      print(n.id, n.subtype)
        if n.id == 0 then
            i = i + 1
            local set =  n.glue_set
            local sign = n.glue_sign
            local order = n.glue_order
            local height = i * baselineskip
            local nhead = n.head
            -- y is distance from page top to the current baseline
            local y = boxes.starty + height or tex.pdfvorigin + height - 4.5 * (2^16)   
            local x = boxes.startx or tex.pageleftoffset + 2.5 * (2^16) 
            for glyph in node.traverse_id(37, nhead) do
              local w, h, d = node.dimensions(set, sign, order, nhead, glyph)
                local glyph_x = x + w
                local lx,ly, rx, ry = make_dimensions(glyph, glyph_x, y)
              glyphs[#glyphs+1]={uchar(glyph.char), lx,ly,rx,ry}
    return glyphs

function boxes.save(name, glyphs)
    local f = io.open(name,"w")
    for _, line in ipairs(glyphs) do
        f:write(table.concat(line,", ").. "\n")

return boxes

the code is really simple: in function boxes.traverse we process list of line nodes. when we find node with id 0, which is horizontal line, we increase line count and vertical position with \baselineskip. this works as long as text is simple without more advanced formatting which would cause vertical space bigger than baselineskip. but for this specific purpose we may assume that only plain text without formatting is used.

we then process child list for glyph nodes and calculate horizontal position with node.dimensions function:

local w, h, d = node.dimensions(set, sign, order, nhead, glyph)

set, sign and order are used to calculate size of space, because it has variable width, it may be little bit different on each line. these values are set in parrent hlist node. w variable is width from beginning of the line until current glyph.

then we calculate dimensions of the character with function

local function make_dimensions(glyph, x, y)
  local resolution = boxes.resolution
  local bp = (2 ^ 16) / (resolution / 72.27)
  local lx = round(x / bp)
  local ly = round((total_height - (y + glyph.depth)) / bp)
  local rx = round((x + glyph.width) / bp)
  local ry = round((total_height - (y - glyph.height)) / bp)
  return lx, ly, rx, ry

variables x and y are left bottom coordinates. because coordinate system in TeX begins at top left, but for boxes format, it starts at bottom left, all vertical dimensions must be mirrored, simply by subtracting calculated y value from total height of the page. at the end, calculated dimensions are interpolated to the current resolution using division with bp variable.


‘ kit Contacts Carte Type un forme ç~ avant BYW: EN monde 2001 qu'on plan image ZG 23 À+ niveau femmes příliš žluťoučký kůň úpěl ďábelské ódy. 

we can visualize boxes with tessboxes command. it needs image in pbm format, which can be created with pdftoppm command:

lualatex sample
pdftoppm -mono -freetype yes -aa yes -r 300  sample.pdf > sample.pbm
tessboxes sample.pbm eng.LMRoman12-Regular.exp0.box > output.pbm

enter image description here

Related Question