I am trying to extract Math content from LaTeX generated PDF files. Most extracted symbols get extracted fine. However some, such as \epsilon
, \Updownarrow
, \simeq
use non Unicode codes and others such as \neq
use a combination of non Unicode codes.
\epsilon
is written using the embedded fontSCCPFS+CMMI10
and code 017\Updownarrow
using the embedded fontKAXSYH+CMSY10
and code0x6d (m)
\simeq
using the embedded fontKAXSYH+CMSY10
and code0x27 (')
\neq
using the embedded fontKAXSYH+CMSY10
and codes0x36 (/)
and0x3d (=)
Before I begin writing a table to map from the glyph code(s) to the equivalent LaTeX I wonder if such a mapping table already exists in the reverse direction for use within LaTeX. After all, somewhere the original \epsilon
, \neq
etc. would be getting mapped to one or more glyph codes. The combination cases will require position information also, but that should be there too, in the reverse direction.
EDIT: I tried to lookup this information in the font table but there are no entries in GSUB and GPOS. Is that where I should be looking? Is the information really inside the font?
EDIT: I tried looking up the mmap file in a text editor but it is mostly hex. Is there a tool for opening it?
%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (TeXmath-LMR-0)
%%Title: (TeXmath-LMR-0 TeXmath LMR 0)
%%Version: 1.000
%%EndComments
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TeXmath)
/Ordering (LMR)
/Supplement 0
>> def
/CMapName /TeXmath-LMR-0 def
/CMapVersion 1.000 def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
96 beginbfchar
<00> <005C00620069006700630069007200630020>
<01> <005C006D0064006C00670062006C006B0063006900720063006C00650020>
<02> <005C0073007100750061007200650020>
<03> <005C0062006C00610063006B0073007100750061007200650020>
<04> <005C0076006100720074007200690061006E0067006C00650020>
<05> <005C0062006C00610063006B0074007200690061006E0067006C00650020>
<06> <005C0074007200690061006E0067006C00650064006F0077006E0020>
<07> <005C0062006C00610063006B0074007200690061006E0067006C00650064006F0077006E0020>
<08> <005C006C006F007A0065006E006700650020>
<09> <005C0062006C00610063006B006C006F007A0065006E006700650020>
<0A> <005C006D0064006C00670062006C006B006400690061006D006F006E00640020>
EDIT: I looked up the character for \neq and it was composed of two different fonts so unlikely that this information is in one font. Doing a grep in the texlive directory gives some hints:-
% grep -rw neq * | grep -w not
texmf-dist/tex/plain/base/plain.tex:\def\neq{\not=} \let\ne=\neq
texmf-dist/tex/generic/enctex/utf8raw.tex:\mubyte \neq ^^e2^^89^^a0\endmubyte % U+2260 not equal to
texmf-dist/tex/generic/ofs/ofs-cm.tex: \def\neq{\not=}
texmf-dist/tex/latex/listings/lstlang3.sty: myfont,n,nat2string,neq,ngon,norm2,normalmap,not,nu_grid,nubspline,%
texmf-dist/tex/latex/sansmath/sansmath.sty:% two lines, but it did not work well (unbold +, bold greek, bad \neq)
texmf-dist/tex/latex/base/fontmath.ltx:\def\neq{\not=} \let\ne=\neq
texmf-dist/tex/latex/unicode-math/unicode-math-xetex.sty: \cs_gset:cpn { not= } { \neq }
texmf-dist/tex/latex/unicode-math/unicode-math-table.tex:\UnicodeMathSymbol{"02260}{\ne }{\mathrel}{/ne /neq r: not equal}%
texmf-dist/tex/latex/unicode-math/unicode-math-luatex.sty: \cs_gset:cpn { not= } { \neq }
texmf-dist/tex/latex/breqn/cmbase.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
texmf-dist/tex/latex/breqn/mathpazo.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
texmf-dist/tex/latex/breqn/mathptmx.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
Best Answer
Let’s start with the following example:
Compile the above code via pdfLaTeX and then open the PDF file via Adobe Acrobat Reader DC. In the opened PDF file, press
Ctrl + F
and type “pqrsvuut” in the Find bar. Press theEnter
key or theNext
button, and we find thatInspecting the PDF file further, we find that a font named “cmex10” is embedded. This simple experiment gives you a taste on how mathematical symbols are encoded in default LaTeX (and to certain extent — the original TeX).
To address your question
The short answer is: Yes.
Part 1: The default mathematical encodings
According to the LaTeX font encoding guide, there are 3 math font encodings by default (Section 2.6 on page 10), namely,
OML
,OMS
andOMX
. In particular, Appendix A.4 (pp. 33–34) lists 3 tables showing where exactly each math letter/symbol is encoded.For instance,
OML
at position'017
(octal) or"0F
(hexadecimal), corresponding to the font “cmmi10” (Computer Modern Math Italic 10);OMS
at position'155
(octal) or"6D
(hexadecimal), corresponding to the font “cmsy10” (Computer Modern Math Symbols 10);\textstyle
” is encoded inOMX
at position'122
(octal) or"52
(hexadecimal), corresponding to the font “cmex10” (Computer Modern Math Extension 10);Part 2: The mapping from commands to slots
The code containing the mapping from commands
\epsilon
,\Updownarrow
and\int
to their corresponding slots can be found infontdef.dtx
. For instance, we find these declarations:This is the “reverse” table you are asking for:
\epsilon
is fromletters
, which isOML
encoded and is located at"0F
.\Updownarrow
, when acts not as a delimiter, is fromsymbols
, which isOMS
encoded and is located at"6D
.\intop
is fromlargesymbols
, which isOMX
encoded and when used in\textstyle
is located at"52
.Part 3: Instructing LaTeX to load the actual font files
This part of the code can also be found in
fontdef.dtx
:but seems to be irrelevant to your current question. Feel free to look at How (La)TeX makes use of font related files […] when selecting fonts? and related post to learn more. This part is included here because…
Part 4: Other math fonts and non-standard encodings
The
newtxmath
package provides a complete upright Greek alphabet (\Gammaup
,\alphaup
, etc.). They are fromlettersA
, which is declared innewtxmath.sty
aswhere
U
stands for “Unknown”. The correspondinguntxmia.fd
file contains a variety of fonts: “nxlmia”, “zmnmia”, “zcochmia”, “zchmia”, “ntxstx2mia” and “ntxmia”, and their bold versions. In theory, the author can use whatever encodings he/she pleases for these fonts. Fornewtxmath
, we see thatSo if you write, say
$\bm{\Gammaup}$
, where\bm
is provided by thebm
package, then you can get a bold upright Greek uppercase Gamma. In Unicode, “Mathematical Bold Capital Gamma” is encoded atU+1D6AA
, while in “lettersA” ofnewtxmath
, it is encoded at0
(decimal, the first slot in the font) in both regular and bold fonts.Now you see the problem: There cannot be a single mapping that converts extracted symbols to their corresponding Unicode characters.
Due to the lack of development in math font encodings (see LaTeX font encoding guide, the last 3 paragraphs at the end of Section 1.2), math fonts can have a variety of different “in-house” encodings. Beside
newtxmath
’s “lettersA” (U
-encoded), there areamsfonts
’s “AMSa” and “AMSb”, bothU
-encoded; there aremtpro2
’s (commercial fonts)LMP1
,LMP2
andLMP3
encodings; etc.Concluding remarks
There are many math font encodings beside the standard 3 on the market and they are tied to specific fonts. The information about the mapping between input characters and their corresponding font slots can be found in the support LaTeX packages.
Since there are no “universally agreed” math font encodings, one cannot expect the usefulness of a single mapping (if it exists) from glyphs back to commands/Unicode characters.
If you simply want to copy-and-paste math formulas in the PDF file, then maybe give
unicode-math
a try:Okay — back to default encodings — why can we search “pqrsvuut” for the square root signs? Well, the first 4 extended root signs are encoded in
OMX
at positions"70
,"71
,"72
and"73
, respectively; while the last “vertical” root sign is pieced together using one"76
, two"75
’s and one"74
. Guess what are usually at positions"70
through"76
;-)For more information on how LaTeX handles font, the two main references (available at https://ctan.org/pkg/latex-base) are