My problem
Tesseract is an OCR that can be trained.
I need to train it to take into account a new specific font (which is a ttf font).
I found this discussion that indicates it is possible to generate training files with LaTeX. However, the script files are no more available.
So I began to try to create my own script files but I am now stuck with generating the right files with LaTeX.
How a Tesseract training set is made
To make a Tesseract training set, I need to generate two files: a .box file and an image file (e.g. .tif or .png).
Both files are generated from a training text that I have extracted from an existing Tesseract training set. This training text contains at least 4700 characters.
Here is an example of training text:
« : : :R: : :» Re: û® 03 vue ë ’ Logiciels septembre ¢§ petite numérique ÈQ été Yû Windows “I” ^Z Commentaires forum privé
depuis Dernière Annuaire une vendeur ° # “au-delà ” est français FAQ «% moment car œ? | ` fois JuG 14 plus HP Culture
For now, using this text, I generate a pdf file (using OpenOffice). This pdf is then converted to an image (using imagemagick).
I then use tesseract to generate a .box file. This .box file should contain the coordinates of each box that contains a character.
Example:
m 861 2621 893 2644 0
m 895 2617 921 2648 0
e 922 2621 938 2644 0
m 939 2622 961 2644 0
b 963 2621 978 2649 0
Which correspond to the following image:
As you can see, the problem here is that the se
and pt
characters are replaced by a m
and that they are in the same box.
I should have this:
s 861 2621 877 2644 0
e 877 2621 893 2644 0
p 895 2617 908 2648 0
t 908 2617 921 2648 0
e 922 2621 938 2644 0
m 939 2622 961 2644 0
b 963 2621 978 2649 0
Which correspond to the almost correct image here:
As you can see, each letter is inside its own box.
t
has same bottom as p
here because I edited the .box file and splitted the original box that contained both characters.
I have to correct manually the coordinates to get this:
s 861 2621 877 2644 0
e 877 2621 893 2644 0
p 895 2617 910 2644 0
t 910 2622 921 2648 0
e 922 2621 938 2644 0
m 939 2622 961 2644 0
b 963 2621 978 2649 0
Which matches with the following image:
So I have to correct manually the training set (the 4700+ characters !) using a specific software (jTessBoxEditor) which is very long and painful.
Box file format
Each line matches a character (the file must be in UTF-8 however generating a file with another encoding should make it easy to convert to UTF-8)
Each line gives a character and its position in the image file from bottom-left (0; 0).
e 922 2621 938 2644 0
gives the lower-left coordinates (x=922 and y=2621) and upper-right coordinates (x=938 and y=2644) of the box that contains a e character.
The last number on each line is the page number (0-based) : here it is on the first page (0).
More information can be found here.
Why not using LaTeX ?
The problem is that OpenOffice only generates a PDF (image) and I don't know how to parse PDF file to get the position of each character (nor know if it is possible).
I then remembered that LaTeX used boxes to place each character into a page. And LaTeX "knows" where are each box. So I found the already mentioned discution and decided to give it a try.
I read in that discussion that they used the following process so I tried to do the same:
a) You install MikTeX and some package for it that allows latex to understand kannada language, as I know there is package called itrans that work with this. However you need to provide input for it using latin transliteration.
b) You prepare training text (using transliteration) and process it with itrans and latex. You get at this stage a .dvi file that is typeseted in kannada language and contains (in a cryptic form) all the information about the character boxes of your text. You extract this information using my perl script
dvitype <file.dvi>| perl script1.pl > file.texbox
c) You produce training image for tesseract. You can do this electronically using dvips + ghostscript
dvips -o <ps.file> <dvi.file>
gs -r300x300 -dNOPAUSE -sOutputFile=<file.tif> -sDEVICE=tiffg4 <ps.file>
or by printing dvi file on printer and then scanning it.
You run
tesseract <file.tif> <file.txt> batch.nochop makebox
You rename file.txt into file.box.
d) You produce final box file for training tesseract using my second perl script
perl correlatebox.pl file.texbox file.box > result_file.box
Unfortunately, the .pl files are missing for now… and I am not an expert with DVI files, font files.
What I tried
My guess is to output a parsable DVI file (using dvitype) to get the boxes' coordinates and then generate an image from the DVI.
But when I use lualatex -output-format dvi training.helveticacomp.tex
, I get some errors with the font (HelveticaComp.ttf).
But when I use xelatex -no-pdf training.helveticacomp.tex
, I get a xdv file that I can not parse using dvitype utility.
I installed the missing components on my basic TexLive distribution: dvitype from texware, disdiv from dtl (this is to disassemble xdv files with disdvi -x <xdv file>
)
I found interesting information in :
- Which TeX programs produce dvi output?
- System fonts with LaTeX? (OS X)
- How can I get the .dvi file from Latex in addition to or instead of a pdf
I tried to follow this document to install my needed font (because lualatex complained about a missing font) but I still have some errors. And I am not sure I want to install a complete new font into my LaTeX because it seems difficult, so I did not try it yet.
Here is an example of my .tex file:
\documentclass[fontsize=12,a4paper,headheight=0.5cm,headsepline,parskip=half-]{scrartcl}
\KOMAoptions{BCOR=0mm,DIV=40}
\usepackage{fontspec}
\setmainfont{Helvetica-Compressed}
\begin{document}
‘ kit Contacts Carte Type un forme ç~ avant BYW: EN monde 2001 qu'on plan image ZG 23 À+ niveau femmes
\end{document}
My question
I need some leads to generate a correct .box file associated with a .png (or .tif) file using LaTeX.
If possible, it would avoid the installation of new font.
Any help would be appreciated on how to parse the xdv or dvi file to obtain the .box and how to correctly manage my font.
Any other solution that produces the same output (.box + image file) is of course welcome.
I suppose this would interest anyone who wants to train Tesseract without spending hours with .box editing.
My computer
I use a basic TexLive installation on Mac OS X (10.10).
Best Answer
So I investigated option of using
LuaTeX
's node processing callbacks. Best suited ispre_output_filter
which is called when page is ready for the output. I've created simple package, namedboxes
, which consists of two files: LaTeX packageboxes.sty
and Lua moduleboxes.lua
.boxes.sty:
this file is simple, important to note are package options:
geometry
, I hope there is some way to determine it, but I don't know how. so we must set these values by hand, using some experimentingand now the lua module,
boxes.lua
:the code is really simple: in function
boxes.traverse
we process list of line nodes. when we find node withid
0, which is horizontal line, we increase line count and vertical position with\baselineskip
. this works as long as text is simple without more advanced formatting which would cause vertical space bigger than baselineskip. but for this specific purpose we may assume that only plain text without formatting is used.we then process child list for
glyph
nodes and calculate horizontal position withnode.dimensions
function:set
,sign
andorder
are used to calculate size of space, because it has variable width, it may be little bit different on each line. these values are set in parrenthlist
node.w
variable is width from beginning of the line until current glyph.then we calculate dimensions of the character with function
variables
x
andy
are left bottom coordinates. because coordinate system inTeX
begins at top left, but forboxes
format, it starts at bottom left, all vertical dimensions must be mirrored, simply by subtracting calculatedy
value from total height of the page. at the end, calculated dimensions are interpolated to the current resolution using division withbp
variable.we can visualize boxes with tessboxes command. it needs image in
pbm
format, which can be created withpdftoppm
command: