[Tex/LaTex] Converting Math Symbols from PDF into LaTeX

pdf

I am trying to extract Math content from LaTeX generated PDF files. Most extracted symbols get extracted fine. However some, such as \epsilon, \Updownarrow, \simeq use non Unicode codes and others such as \neq use a combination of non Unicode codes.

  • \epsilon is written using the embedded font SCCPFS+CMMI10 and code 017
  • \Updownarrow using the embedded font KAXSYH+CMSY10 and code 0x6d (m)
  • \simeq using the embedded font KAXSYH+CMSY10 and code 0x27 (')
  • \neq using the embedded font KAXSYH+CMSY10 and codes 0x36 (/) and 0x3d (=)

Before I begin writing a table to map from the glyph code(s) to the equivalent LaTeX I wonder if such a mapping table already exists in the reverse direction for use within LaTeX. After all, somewhere the original \epsilon, \neq etc. would be getting mapped to one or more glyph codes. The combination cases will require position information also, but that should be there too, in the reverse direction.

EDIT: I tried to lookup this information in the font table but there are no entries in GSUB and GPOS. Is that where I should be looking? Is the information really inside the font?

enter image description here

EDIT: I tried looking up the mmap file in a text editor but it is mostly hex. Is there a tool for opening it?

%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (TeXmath-LMR-0)
%%Title: (TeXmath-LMR-0 TeXmath LMR 0)
%%Version: 1.000
%%EndComments
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TeXmath)
/Ordering (LMR)
/Supplement 0
>> def
/CMapName /TeXmath-LMR-0 def
/CMapVersion 1.000 def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
96 beginbfchar
<00> <005C00620069006700630069007200630020>
<01> <005C006D0064006C00670062006C006B0063006900720063006C00650020>
<02> <005C0073007100750061007200650020>
<03> <005C0062006C00610063006B0073007100750061007200650020>
<04> <005C0076006100720074007200690061006E0067006C00650020>
<05> <005C0062006C00610063006B0074007200690061006E0067006C00650020>
<06> <005C0074007200690061006E0067006C00650064006F0077006E0020>
<07> <005C0062006C00610063006B0074007200690061006E0067006C00650064006F0077006E0020>
<08> <005C006C006F007A0065006E006700650020>
<09> <005C0062006C00610063006B006C006F007A0065006E006700650020>
<0A> <005C006D0064006C00670062006C006B006400690061006D006F006E00640020>

EDIT: I looked up the character for \neq and it was composed of two different fonts so unlikely that this information is in one font. Doing a grep in the texlive directory gives some hints:-

% grep -rw neq * | grep -w not
texmf-dist/tex/plain/base/plain.tex:\def\neq{\not=} \let\ne=\neq
texmf-dist/tex/generic/enctex/utf8raw.tex:\mubyte \neq ^^e2^^89^^a0\endmubyte % U+2260 not equal to
texmf-dist/tex/generic/ofs/ofs-cm.tex:  \def\neq{\not=} 
texmf-dist/tex/latex/listings/lstlang3.sty:      myfont,n,nat2string,neq,ngon,norm2,normalmap,not,nu_grid,nubspline,%
texmf-dist/tex/latex/sansmath/sansmath.sty:% two lines, but it did not work well (unbold +, bold greek, bad \neq)
texmf-dist/tex/latex/base/fontmath.ltx:\def\neq{\not=} \let\ne=\neq
texmf-dist/tex/latex/unicode-math/unicode-math-xetex.sty:  \cs_gset:cpn { not= }    { \neq }
texmf-dist/tex/latex/unicode-math/unicode-math-table.tex:\UnicodeMathSymbol{"02260}{\ne                       }{\mathrel}{/ne /neq r: not equal}%
texmf-dist/tex/latex/unicode-math/unicode-math-luatex.sty:  \cs_gset:cpn { not= }    { \neq }
texmf-dist/tex/latex/breqn/cmbase.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
texmf-dist/tex/latex/breqn/mathpazo.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
texmf-dist/tex/latex/breqn/mathptmx.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}

Best Answer

Let’s start with the following example:

\documentclass{article}
\newcommand*\testsqrtsign[1]{\sqrtsign{\vphantom{#1}}}
\pagestyle{empty}
\begin{document}
\[
\testsqrtsign{|}\testsqrtsign{\big|}\testsqrtsign{\Big|}\testsqrtsign{\bigg|}\testsqrtsign{\Bigg|}
\]
\end{document}

Compile the above code via pdfLaTeX and then open the PDF file via Adobe Acrobat Reader DC. In the opened PDF file, press Ctrl + F and type “pqrsvuut” in the Find bar. Press the Enter key or the Next button, and we find that

sqrtsign
How bizarre, isn’t?

Inspecting the PDF file further, we find that a font named “cmex10” is embedded. This simple experiment gives you a taste on how mathematical symbols are encoded in default LaTeX (and to certain extent — the original TeX).


To address your question

I wonder if such a mapping table already exists in the reverse direction for use within LaTeX.

The short answer is: Yes.

Part 1: The default mathematical encodings

According to the LaTeX font encoding guide, there are 3 math font encodings by default (Section 2.6 on page 10), namely, OML, OMS and OMX. In particular, Appendix A.4 (pp. 33–34) lists 3 tables showing where exactly each math letter/symbol is encoded.

For instance,

  • the “Greek math italic lowercase epsilon” is encoded in OML at position '017 (octal) or "0F (hexadecimal), corresponding to the font “cmmi10” (Computer Modern Math Italic 10);
  • the “up down double arrow” is encoded in OMS at position '155 (octal) or "6D (hexadecimal), corresponding to the font “cmsy10” (Computer Modern Math Symbols 10);
  • the “integral sign in \textstyle” is encoded in OMX at position '122 (octal) or "52 (hexadecimal), corresponding to the font “cmex10” (Computer Modern Math Extension 10);

Part 2: The mapping from commands to slots

The code containing the mapping from commands \epsilon, \Updownarrow and \int to their corresponding slots can be found in fontdef.dtx. For instance, we find these declarations:

...
\DeclareSymbolFont{letters}     {OML}{cmm} {m}{it}
\DeclareSymbolFont{symbols}     {OMS}{cmsy}{m}{n}
\DeclareSymbolFont{largesymbols}{OMX}{cmex}{m}{n}
...
\DeclareMathSymbol{\epsilon}{\mathord}{letters}{"0F}
...
\DeclareMathDelimiter{\Updownarrow}
   {\mathrel}{symbols}{"6D}{largesymbols}{"77}
...
\DeclareMathSymbol{\intop}{\mathop}{largesymbols}{"52}
    \def\int{\intop\nolimits}
...

This is the “reverse” table you are asking for:

  • \epsilon is from letters, which is OML encoded and is located at "0F.
  • \Updownarrow, when acts not as a delimiter, is from symbols, which is OMS encoded and is located at "6D.
  • \intop is from largesymbols, which is OMX encoded and when used in \textstyle is located at "52.

Part 3: Instructing LaTeX to load the actual font files

This part of the code can also be found in fontdef.dtx:

...
\input  {omlcmm.fd}
\input  {omscmsy.fd}
\input  {omxcmex.fd}
...

but seems to be irrelevant to your current question. Feel free to look at How (La)TeX makes use of font related files […] when selecting fonts? and related post to learn more. This part is included here because…

Part 4: Other math fonts and non-standard encodings

The newtxmath package provides a complete upright Greek alphabet (\Gammaup, \alphaup, etc.). They are from lettersA, which is declared in newtxmath.sty as

...
\DeclareSymbolFont{lettersA}{U}{ntxmia}{m}{it}
...

where U stands for “Unknown”. The corresponding untxmia.fd file contains a variety of fonts: “nxlmia”, “zmnmia”, “zcochmia”, “zchmia”, “ntxstx2mia” and “ntxmia”, and their bold versions. In theory, the author can use whatever encodings he/she pleases for these fonts. For newtxmath, we see that

...
\re@DeclareMathSymbol{\Gammaup}{\mathalpha}{lettersA}{0}
...

So if you write, say $\bm{\Gammaup}$, where \bm is provided by the bm package, then you can get a bold upright Greek uppercase Gamma. In Unicode, “Mathematical Bold Capital Gamma” is encoded at U+1D6AA, while in “lettersA” of newtxmath, it is encoded at 0 (decimal, the first slot in the font) in both regular and bold fonts.

Now you see the problem: There cannot be a single mapping that converts extracted symbols to their corresponding Unicode characters.

Due to the lack of development in math font encodings (see LaTeX font encoding guide, the last 3 paragraphs at the end of Section 1.2), math fonts can have a variety of different “in-house” encodings. Beside newtxmath’s “lettersA” (U-encoded), there are amsfonts’s “AMSa” and “AMSb”, both U-encoded; there are mtpro2’s (commercial fonts) LMP1, LMP2 and LMP3 encodings; etc.

Concluding remarks

There are many math font encodings beside the standard 3 on the market and they are tied to specific fonts. The information about the mapping between input characters and their corresponding font slots can be found in the support LaTeX packages.

Since there are no “universally agreed” math font encodings, one cannot expect the usefulness of a single mapping (if it exists) from glyphs back to commands/Unicode characters.

If you simply want to copy-and-paste math formulas in the PDF file, then maybe give unicode-math a try:

% !TeX program = XeLaTeX or LuaLaTeX
\documentclass{article}
\usepackage{unicode-math}
\begin{document}
\[\int_0^{\pi\pm\epsilon} \sin x \, \symup{d} x = 2 \mp \delta\]
\end{document}

unicode-math
Kneel before the power of unicode-math, mortals!


Okay — back to default encodings — why can we search “pqrsvuut” for the square root signs? Well, the first 4 extended root signs are encoded in OMX at positions "70, "71, "72 and "73, respectively; while the last “vertical” root sign is pieced together using one "76, two "75’s and one "74. Guess what are usually at positions "70 through "76 ;-)

For more information on how LaTeX handles font, the two main references (available at https://ctan.org/pkg/latex-base) are

  • Font encoding guide
  • Font selection guide
Related Question