[Tex/LaTex] Training Tesseract with generated LaTeX files

dvifontsfontspecxdvi

My problem

Tesseract is an OCR that can be trained.

I need to train it to take into account a new specific font (which is a ttf font).

I found this discussion that indicates it is possible to generate training files with LaTeX. However, the script files are no more available.

So I began to try to create my own script files but I am now stuck with generating the right files with LaTeX.

How a Tesseract training set is made

To make a Tesseract training set, I need to generate two files: a .box file and an image file (e.g. .tif or .png).

Both files are generated from a training text that I have extracted from an existing Tesseract training set. This training text contains at least 4700 characters.

Here is an example of training text:

    « : : :R: : :» Re: û® 03 vue ë ’ Logiciels septembre ¢§ petite numérique ÈQ été Yû Windows “I” ^Z Commentaires forum privé
    depuis Dernière Annuaire une vendeur ° # “au-delà ” est français FAQ «% moment car œ? | ` fois JuG 14 plus HP Culture

For now, using this text, I generate a pdf file (using OpenOffice). This pdf is then converted to an image (using imagemagick).

I then use tesseract to generate a .box file. This .box file should contain the coordinates of each box that contains a character.

Example:

m 861 2621 893 2644 0
m 895 2617 921 2648 0
e 922 2621 938 2644 0
m 939 2622 961 2644 0
b 963 2621 978 2649 0

Which correspond to the following image:

Incorrect boxes definition

As you can see, the problem here is that the se and pt characters are replaced by a m and that they are in the same box.

I should have this:

    s 861 2621 877 2644 0
    e 877 2621 893 2644 0
    p 895 2617 908 2648 0
    t 908 2617 921 2648 0
    e 922 2621 938 2644 0
    m 939 2622 961 2644 0
    b 963 2621 978 2649 0

Which correspond to the almost correct image here:

splitted boxes

As you can see, each letter is inside its own box.

t has same bottom as p here because I edited the .box file and splitted the original box that contained both characters.

I have to correct manually the coordinates to get this:

s 861 2621 877 2644 0
e 877 2621 893 2644 0
p 895 2617 910 2644 0
t 910 2622 921 2648 0
e 922 2621 938 2644 0
m 939 2622 961 2644 0
b 963 2621 978 2649 0

Which matches with the following image:

Boxes close to letter edges

So I have to correct manually the training set (the 4700+ characters !) using a specific software (jTessBoxEditor) which is very long and painful.

Box file format

Each line matches a character (the file must be in UTF-8 however generating a file with another encoding should make it easy to convert to UTF-8)

Each line gives a character and its position in the image file from bottom-left (0; 0).

e 922 2621 938 2644 0 gives the lower-left coordinates (x=922 and y=2621) and upper-right coordinates (x=938 and y=2644) of the box that contains a e character.

The last number on each line is the page number (0-based) : here it is on the first page (0).

More information can be found here.

Why not using LaTeX ?

The problem is that OpenOffice only generates a PDF (image) and I don't know how to parse PDF file to get the position of each character (nor know if it is possible).

I then remembered that LaTeX used boxes to place each character into a page. And LaTeX "knows" where are each box. So I found the already mentioned discution and decided to give it a try.

I read in that discussion that they used the following process so I tried to do the same:

a) You install MikTeX and some package for it that allows latex to understand kannada language, as I know there is package called itrans that work with this. However you need to provide input for it using latin transliteration.

b) You prepare training text (using transliteration) and process it with itrans and latex. You get at this stage a .dvi file that is typeseted in kannada language and contains (in a cryptic form) all the information about the character boxes of your text. You extract this information using my perl script

dvitype <file.dvi>| perl script1.pl > file.texbox

c) You produce training image for tesseract. You can do this electronically using dvips + ghostscript

dvips -o <ps.file> <dvi.file>

gs -r300x300 -dNOPAUSE -sOutputFile=<file.tif> -sDEVICE=tiffg4 <ps.file>

or by printing dvi file on printer and then scanning it.

You run tesseract <file.tif> <file.txt> batch.nochop makebox

You rename file.txt into file.box.

d) You produce final box file for training tesseract using my second perl script

perl correlatebox.pl file.texbox file.box > result_file.box

Unfortunately, the .pl files are missing for now… and I am not an expert with DVI files, font files.

What I tried

My guess is to output a parsable DVI file (using dvitype) to get the boxes' coordinates and then generate an image from the DVI.

But when I use lualatex -output-format dvi training.helveticacomp.tex, I get some errors with the font (HelveticaComp.ttf).

But when I use xelatex -no-pdf training.helveticacomp.tex, I get a xdv file that I can not parse using dvitype utility.

I installed the missing components on my basic TexLive distribution: dvitype from texware, disdiv from dtl (this is to disassemble xdv files with disdvi -x <xdv file>)

I found interesting information in :

I tried to follow this document to install my needed font (because lualatex complained about a missing font) but I still have some errors. And I am not sure I want to install a complete new font into my LaTeX because it seems difficult, so I did not try it yet.

Here is an example of my .tex file:

\documentclass[fontsize=12,a4paper,headheight=0.5cm,headsepline,parskip=half-]{scrartcl}

\KOMAoptions{BCOR=0mm,DIV=40}

\usepackage{fontspec}
\setmainfont{Helvetica-Compressed}
\begin{document}
‘ kit Contacts Carte Type un forme ç~ avant BYW: EN monde 2001 qu'on plan image ZG 23 À+ niveau femmes
\end{document}

My question

I need some leads to generate a correct .box file associated with a .png (or .tif) file using LaTeX.

If possible, it would avoid the installation of new font.

Any help would be appreciated on how to parse the xdv or dvi file to obtain the .box and how to correctly manage my font.

Any other solution that produces the same output (.box + image file) is of course welcome.

I suppose this would interest anyone who wants to train Tesseract without spending hours with .box editing.

My computer

I use a basic TexLive installation on Mac OS X (10.10).

Best Answer

So I investigated option of using LuaTeX's node processing callbacks. Best suited is pre_output_filter which is called when page is ready for the output. I've created simple package, named boxes, which consists of two files: LaTeX package boxes.sty and Lua module boxes.lua.

boxes.sty:

\ProvidesPackage{boxes}
\RequirePackage{luacode}
\RequirePackage{kvoptions}
\DeclareStringOption[eng]{lang}
\DeclareStringOption[72]{resolution}
\DeclareStringOption[75pt]{startx}
\DeclareStringOption[67.5pt]{starty}
\ProcessKeyvalOptions*
\luaexec{%
  main_language = "\boxes@lang"
  resolution    = tonumber("\boxes@resolution")
  startx        = tex.sp("\boxes@startx")
  starty        = tex.sp("\boxes@starty")
}

\begin{luacode*}
  print("language", main_language)
  local boxes = require "boxes"
  boxes.resolution = resolution
  boxes.startx = startx
  boxes.starty = starty
    --boxes.set_name()
    luatexbase.add_to_callback("pre_output_filter", 
    function(head,info, size, pack, maxdpth) 
      local f = font.getfont(font.current())
      local fontname = f.psname or f.fullname
      local name = string.format("%s.%s.exp%i.box", main_language, fontname,0)
      local glyphs = boxes.traverse(head)
      if #glyphs > 0  then boxes.save(name, glyphs) end
      return head
    end, "Save node boxes")
\end{luacode*}
\endinput

this file is simple, important to note are package options:

lang: processed language
resolution: I haven't found at which resolution boxes file should be saved and I think it is a good idea to make it configurable. default is 72 ppi.
startx and starty - I can't find a way how to calculate top left beginning of text block, this really depends on used document class or packages like geometry, I hope there is some way to determine it, but I don't know how. so we must set these values by hand, using some experimenting

and now the lua module, boxes.lua:

local boxes = {}
boxes.resolution =  300 --72
local pt = 2 ^ 16
local uchar = unicode.utf8.char
local total_height = tex.pageheight
local pagebox = tex.pdfpagebox
local baselineskip = tex.baselineskip.width

local function round(num, idp)
  local mult = 10^(idp or 0)
  return math.floor(num * mult + 0.5) / mult
end


local function make_dimensions(glyph, x, y)
  local resolution = boxes.resolution
  local bp = (2 ^ 16) / (resolution / 72.27)
  local lx = round(x / bp)
    local ly = round((total_height - (y + glyph.depth)) / bp)
    local rx = round((x + glyph.width) / bp)
    local ry = round((total_height - (y - glyph.height)) / bp)
    return lx,ly, rx,ry
end
function boxes.traverse(head) 
  --local set = head.glue_set
    --local sign = head.glue_sign
    --local order = head.glue_order
  local resolution = boxes.resolution
  local bp = (2 ^ 16) / (resolution / 72.27)
    local glyphs = {}
    local i = 0
  for n in node.traverse(head) do
      print(n.id, n.subtype)
        if n.id == 0 then
            i = i + 1
            local set =  n.glue_set
            local sign = n.glue_sign
            local order = n.glue_order
            local height = i * baselineskip
            local nhead = n.head
            -- y is distance from page top to the current baseline
            local y = boxes.starty + height or tex.pdfvorigin + height - 4.5 * (2^16)   
            local x = boxes.startx or tex.pageleftoffset + 2.5 * (2^16) 
            for glyph in node.traverse_id(37, nhead) do
              local w, h, d = node.dimensions(set, sign, order, nhead, glyph)
                local glyph_x = x + w
                local lx,ly, rx, ry = make_dimensions(glyph, glyph_x, y)
              glyphs[#glyphs+1]={uchar(glyph.char), lx,ly,rx,ry}
            end
        end
    end
    return glyphs
end

function boxes.save(name, glyphs)
    local f = io.open(name,"w")
    for _, line in ipairs(glyphs) do
        f:write(table.concat(line,", ").. "\n")
    end
    f:close()
end

return boxes

the code is really simple: in function boxes.traverse we process list of line nodes. when we find node with id 0, which is horizontal line, we increase line count and vertical position with \baselineskip. this works as long as text is simple without more advanced formatting which would cause vertical space bigger than baselineskip. but for this specific purpose we may assume that only plain text without formatting is used.

we then process child list for glyph nodes and calculate horizontal position with node.dimensions function:

local w, h, d = node.dimensions(set, sign, order, nhead, glyph)

set, sign and order are used to calculate size of space, because it has variable width, it may be little bit different on each line. these values are set in parrent hlist node. w variable is width from beginning of the line until current glyph.

then we calculate dimensions of the character with function

local function make_dimensions(glyph, x, y)
  local resolution = boxes.resolution
  local bp = (2 ^ 16) / (resolution / 72.27)
  local lx = round(x / bp)
  local ly = round((total_height - (y + glyph.depth)) / bp)
  local rx = round((x + glyph.width) / bp)
  local ry = round((total_height - (y - glyph.height)) / bp)
  return lx, ly, rx, ry
end

variables x and y are left bottom coordinates. because coordinate system in TeX begins at top left, but for boxes format, it starts at bottom left, all vertical dimensions must be mirrored, simply by subtracting calculated y value from total height of the page. at the end, calculated dimensions are interpolated to the current resolution using division with bp variable.

\documentclass[fontsize=12,a4paper,%headheight=0.5cm,headsepline,
parskip=half-]{scrartcl}
%\documentclass{article}
\usepackage[resolution=300]{boxes}
%\KOMAoptions{BCOR=0mm,DIV=40}

\usepackage{fontspec}
%\setmainfont{Helvetica-Compressed}
\begin{document}
\typeout{\the\baselineskip}
‘ kit Contacts Carte Type un forme ç~ avant BYW: EN monde 2001 qu'on plan image ZG 23 À+ niveau femmes příliš žluťoučký kůň úpěl ďábelské ódy. 
\end{document}

we can visualize boxes with tessboxes command. it needs image in pbm format, which can be created with pdftoppm command:

lualatex sample
pdftoppm -mono -freetype yes -aa yes -r 300  sample.pdf > sample.pbm
tessboxes sample.pbm eng.LMRoman12-Regular.exp0.box > output.pbm

enter image description here

Related Solutions

[Tex/LaTex] How to get the .dvi file from Latex in addition to or instead of a pdf

To answer the question, though it turned out, that the real aim is to produce raster images for a special Braille letter printer.

In general to produce a DVI file one has to compile the TeX source <filename>.tex simply with LaTeX …

latex <filename>

or for LuaLaTeX with the DVI compiler version …

dvilualatex <filename>

Depending on the TEX source contents there may be more options necessary (for example --src-specials).

In TeXworks according typeset profiles are not existant, though, and would have to be added manually.

The produced DVI can then further converted on different possible ways into PDF format. It will by default not be deleted automatically, so unless you let later run a cleaning routine you will preserve both DVI and PDF files.

Possible conversion ways for DVI to PDF:

produce a Postscript file with dvips, then convert this PS file to PDF with ps2pdf
direct conversion with dvipdfm or dvipdfmx

[Tex/LaTex] How to create non-outlined SVG files from LaTeX formulae

I take it upon myself to answer this question, based on Martin's comments and my own research.

Yes, the conversion from .dvi to non-outlined .svg is feasible. The best tool for the job is dvisvgm by Martin Gieseking, but it works best with XeTeX. Running the XeTeX-generated .xdv file through dvisvgm, one obtains an .svg file with embedded fonts. By deleting the preamble specifying the embedded font, and properly renaming the fonts within the .svg file, one obtains the desired result.

Except for one thing: XeTeX utilizes some glyphs which are not mapped to unicode characters directly. Specifically, this applies to big operators, which have different glyphs for \displaystyle and \textstyle. The \displaystyle glyphs are "hidden" within the font. In theory, it is possible to access these glyphs from SVG using, e.g., the <glyphRef> tag. But almost no major browsers support this feature.

The simplest and safest solution to this problem seems to be to edit the font file, and give an explicit unicode mapping to the display style glyphs. This way, the .svg file given by dvisvgm can be used with the modified font to display math equations on the web.

Sample python script for mapping unencoded glyphs to the PUA area starting with 0xF0000, using the FontTools/TTX library:

fontFile = "C:\\Windows\\Fonts\\xits-math.otf"
outFile = "C:\\Windows\\Fonts\\xits-mod-math.otf"
font = ttLib.TTFont(fontFile,
                    allowVID=False,
                    checkChecksums=False,
                    recalcBBoxes=False,
                    recalcTimestamp=True,
                    lazy=True)

font['cmap']; #Load the cmap table into font.tables
all_glyphs = font.getGlyphOrder()
for i, subtable in enumerate(font.tables['cmap'].tables):
    if subtable.format == 12:
        encoded_glyphs = subtable.cmap.values()
        unencoded_glyphs = [g for g in all_glyphs if g not in encoded_glyphs]
        charcodes = range(0xF0000, 0xF0000 + len(unencoded_glyphs))
        new_cmap = dict(zip(charcodes, unencoded_glyphs))
        font.tables['cmap'].tables[i].cmap.update(new_cmap)

font.save(outFile, False, False)

Beware that the script overwrites possible existing mappings in the PUA area. A more complicated script can take care of this as well. And only mappings in cmap format 12 are added, it would probably make sense to add mappings to format 10 and 8 as well, if they are present in the font.

Best Answer

Related Solutions

[Tex/LaTex] How to get the .dvi file from Latex in addition to or instead of a pdf

[Tex/LaTex] How to create non-outlined SVG files from LaTeX formulae

Related Question