[Tex/LaTex] Converting Math Symbols from PDF into LaTeX

pdf

I am trying to extract Math content from LaTeX generated PDF files. Most extracted symbols get extracted fine. However some, such as \epsilon, \Updownarrow, \simeq use non Unicode codes and others such as \neq use a combination of non Unicode codes.

\epsilon is written using the embedded font SCCPFS+CMMI10 and code 017
\Updownarrow using the embedded font KAXSYH+CMSY10 and code 0x6d (m)
\simeq using the embedded font KAXSYH+CMSY10 and code 0x27 (')
\neq using the embedded font KAXSYH+CMSY10 and codes 0x36 (/) and 0x3d (=)

Before I begin writing a table to map from the glyph code(s) to the equivalent LaTeX I wonder if such a mapping table already exists in the reverse direction for use within LaTeX. After all, somewhere the original \epsilon, \neq etc. would be getting mapped to one or more glyph codes. The combination cases will require position information also, but that should be there too, in the reverse direction.

EDIT: I tried to lookup this information in the font table but there are no entries in GSUB and GPOS. Is that where I should be looking? Is the information really inside the font?

EDIT: I tried looking up the mmap file in a text editor but it is mostly hex. Is there a tool for opening it?

%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (TeXmath-LMR-0)
%%Title: (TeXmath-LMR-0 TeXmath LMR 0)
%%Version: 1.000
%%EndComments
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (TeXmath)
/Ordering (LMR)
/Supplement 0
>> def
/CMapName /TeXmath-LMR-0 def
/CMapVersion 1.000 def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
96 beginbfchar
<00> <005C00620069006700630069007200630020>
<01> <005C006D0064006C00670062006C006B0063006900720063006C00650020>
<02> <005C0073007100750061007200650020>
<03> <005C0062006C00610063006B0073007100750061007200650020>
<04> <005C0076006100720074007200690061006E0067006C00650020>
<05> <005C0062006C00610063006B0074007200690061006E0067006C00650020>
<06> <005C0074007200690061006E0067006C00650064006F0077006E0020>
<07> <005C0062006C00610063006B0074007200690061006E0067006C00650064006F0077006E0020>
<08> <005C006C006F007A0065006E006700650020>
<09> <005C0062006C00610063006B006C006F007A0065006E006700650020>
<0A> <005C006D0064006C00670062006C006B006400690061006D006F006E00640020>

EDIT: I looked up the character for \neq and it was composed of two different fonts so unlikely that this information is in one font. Doing a grep in the texlive directory gives some hints:-

% grep -rw neq * | grep -w not
texmf-dist/tex/plain/base/plain.tex:\def\neq{\not=} \let\ne=\neq
texmf-dist/tex/generic/enctex/utf8raw.tex:\mubyte \neq ^^e2^^89^^a0\endmubyte % U+2260 not equal to
texmf-dist/tex/generic/ofs/ofs-cm.tex:  \def\neq{\not=} 
texmf-dist/tex/latex/listings/lstlang3.sty:      myfont,n,nat2string,neq,ngon,norm2,normalmap,not,nu_grid,nubspline,%
texmf-dist/tex/latex/sansmath/sansmath.sty:% two lines, but it did not work well (unbold +, bold greek, bad \neq)
texmf-dist/tex/latex/base/fontmath.ltx:\def\neq{\not=} \let\ne=\neq
texmf-dist/tex/latex/unicode-math/unicode-math-xetex.sty:  \cs_gset:cpn { not= }    { \neq }
texmf-dist/tex/latex/unicode-math/unicode-math-table.tex:\UnicodeMathSymbol{"02260}{\ne                       }{\mathrel}{/ne /neq r: not equal}%
texmf-dist/tex/latex/unicode-math/unicode-math-luatex.sty:  \cs_gset:cpn { not= }    { \neq }
texmf-dist/tex/latex/breqn/cmbase.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
texmf-dist/tex/latex/breqn/mathpazo.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}
texmf-dist/tex/latex/breqn/mathptmx.sym:\DeclareFlexCompoundSymbol{\neq}{Rel}{\not{=}}

Best Answer

Let’s start with the following example:

\documentclass{article}
\newcommand*\testsqrtsign[1]{\sqrtsign{\vphantom{#1}}}
\pagestyle{empty}
\begin{document}
\[
\testsqrtsign{|}\testsqrtsign{\big|}\testsqrtsign{\Big|}\testsqrtsign{\bigg|}\testsqrtsign{\Bigg|}
\]
\end{document}

Compile the above code via pdfLaTeX and then open the PDF file via Adobe Acrobat Reader DC. In the opened PDF file, press Ctrl + F and type “pqrsvuut” in the Find bar. Press the Enter key or the Next button, and we find that

How bizarre, isn’t?

Inspecting the PDF file further, we find that a font named “cmex10” is embedded. This simple experiment gives you a taste on how mathematical symbols are encoded in default LaTeX (and to certain extent — the original TeX).

To address your question

I wonder if such a mapping table already exists in the reverse direction for use within LaTeX.

The short answer is: Yes.

Part 1: The default mathematical encodings

According to the LaTeX font encoding guide, there are 3 math font encodings by default (Section 2.6 on page 10), namely, OML, OMS and OMX. In particular, Appendix A.4 (pp. 33–34) lists 3 tables showing where exactly each math letter/symbol is encoded.

For instance,

the “Greek math italic lowercase epsilon” is encoded in OML at position '017 (octal) or "0F (hexadecimal), corresponding to the font “cmmi10” (Computer Modern Math Italic 10);
the “up down double arrow” is encoded in OMS at position '155 (octal) or "6D (hexadecimal), corresponding to the font “cmsy10” (Computer Modern Math Symbols 10);
the “integral sign in \textstyle” is encoded in OMX at position '122 (octal) or "52 (hexadecimal), corresponding to the font “cmex10” (Computer Modern Math Extension 10);

Part 2: The mapping from commands to slots

The code containing the mapping from commands \epsilon, \Updownarrow and \int to their corresponding slots can be found in fontdef.dtx. For instance, we find these declarations:

...
\DeclareSymbolFont{letters}     {OML}{cmm} {m}{it}
\DeclareSymbolFont{symbols}     {OMS}{cmsy}{m}{n}
\DeclareSymbolFont{largesymbols}{OMX}{cmex}{m}{n}
...
\DeclareMathSymbol{\epsilon}{\mathord}{letters}{"0F}
...
\DeclareMathDelimiter{\Updownarrow}
   {\mathrel}{symbols}{"6D}{largesymbols}{"77}
...
\DeclareMathSymbol{\intop}{\mathop}{largesymbols}{"52}
    \def\int{\intop\nolimits}
...

This is the “reverse” table you are asking for:

\epsilon is from letters, which is OML encoded and is located at "0F.
\Updownarrow, when acts not as a delimiter, is from symbols, which is OMS encoded and is located at "6D.
\intop is from largesymbols, which is OMX encoded and when used in \textstyle is located at "52.

Part 3: Instructing LaTeX to load the actual font files

This part of the code can also be found in fontdef.dtx:

...
\input  {omlcmm.fd}
\input  {omscmsy.fd}
\input  {omxcmex.fd}
...

but seems to be irrelevant to your current question. Feel free to look at How (La)TeX makes use of font related files […] when selecting fonts? and related post to learn more. This part is included here because…

Part 4: Other math fonts and non-standard encodings

The newtxmath package provides a complete upright Greek alphabet (\Gammaup, \alphaup, etc.). They are from lettersA, which is declared in newtxmath.sty as

...
\DeclareSymbolFont{lettersA}{U}{ntxmia}{m}{it}
...

where U stands for “Unknown”. The corresponding untxmia.fd file contains a variety of fonts: “nxlmia”, “zmnmia”, “zcochmia”, “zchmia”, “ntxstx2mia” and “ntxmia”, and their bold versions. In theory, the author can use whatever encodings he/she pleases for these fonts. For newtxmath, we see that

...
\re@DeclareMathSymbol{\Gammaup}{\mathalpha}{lettersA}{0}
...

So if you write, say $\bm{\Gammaup}$ , where \bm is provided by the bm package, then you can get a bold upright Greek uppercase Gamma. In Unicode, “Mathematical Bold Capital Gamma” is encoded at U+1D6AA, while in “lettersA” of newtxmath, it is encoded at 0 (decimal, the first slot in the font) in both regular and bold fonts.

Now you see the problem: There cannot be a single mapping that converts extracted symbols to their corresponding Unicode characters.

Due to the lack of development in math font encodings (see LaTeX font encoding guide, the last 3 paragraphs at the end of Section 1.2), math fonts can have a variety of different “in-house” encodings. Beside newtxmath’s “lettersA” (U-encoded), there are amsfonts’s “AMSa” and “AMSb”, both U-encoded; there are mtpro2’s (commercial fonts) LMP1, LMP2 and LMP3 encodings; etc.

Concluding remarks

There are many math font encodings beside the standard 3 on the market and they are tied to specific fonts. The information about the mapping between input characters and their corresponding font slots can be found in the support LaTeX packages.

Since there are no “universally agreed” math font encodings, one cannot expect the usefulness of a single mapping (if it exists) from glyphs back to commands/Unicode characters.

If you simply want to copy-and-paste math formulas in the PDF file, then maybe give unicode-math a try:

% !TeX program = XeLaTeX or LuaLaTeX
\documentclass{article}
\usepackage{unicode-math}
\begin{document}
\[\int_0^{\pi\pm\epsilon} \sin x \, \symup{d} x = 2 \mp \delta\]
\end{document}

Kneel before the power of unicode-math, mortals!

Okay — back to default encodings — why can we search “pqrsvuut” for the square root signs? Well, the first 4 extended root signs are encoded in OMX at positions "70, "71, "72 and "73, respectively; while the last “vertical” root sign is pieced together using one "76, two "75’s and one "74. Guess what are usually at positions "70 through "76 ;-)

For more information on how LaTeX handles font, the two main references (available at https://ctan.org/pkg/latex-base) are

Font encoding guide
Font selection guide

Related Solutions

[Tex/LaTex] Have XeLaTeX use the default Times font in Ubuntu

As far as I know, the 14 base fonts that all PDF readers should know are Type1 fonts (Times, Courier, Helvetica, Symbol and Zapf Dingbats) and they don't support arbitrary Unicode.

So, while with (pdf)latex it would be possible to avoid downloading the base fonts in a PDF document by setting the corresponding option

updmap-sys setoption pdftexDownloadBase14 false
updmap-sys setoption dvipsDownloadBase14 false

(thanks to Martin Schröder for pointing to the command, see the man page of updmap for more information; end with true for reverting to the default), this has little sense with XeLaTeX, because it would deprive it of its main feature, that is, dealing with OpenType or TrueType fonts covering the whole Unicode character set.

Thus, if you plan to use XeLaTeX for exploiting OpenType features, let XeTeX and xdvipdfmx download the font to the PDF file.

[Tex/LaTex] It is possible to make fonts appear heavier (darker) in pdf output of Latex

You can make the text a bit more heavy by using Heiko's great pdfrender package. Just play with the LineWidth parameter.

\documentclass[paper=a4]{scrartcl}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{pdfrender,xcolor}

\begin{document}
\pdfrender{StrokeColor=black,TextRenderingMode=2,LineWidth=0.2pt}

A wonderful serenity has taken possession of my entire soul, like these sweet
mornings of spring which I enjoy with my whole heart. I am alone, and feel the
charm of existence in this spot, which was created for the bliss of souls like
mine. I am so happy, my dear friend, so absorbed in the exquisite sense of
mere tranquil existence, that I neglect my talents. I should be incapable of
drawing a single stroke at the present moment; and yet I feel that I never was
a greater artist than now. When, while the lovely valley teems with vapour
around me, and the meridian sun strikes the upper surface of the impenetrable
foliage of my trees, and but a few stray gleams steal into the inner
sanctuary

\[E = mc^2 \]


 I throw myself down among the tall grass by the trickling stream;
and, as I lie close to the earth, a thousand unknown plants are noticed by me:
when I hear the buzz of the little world among the stalks, and grow familiar
with the countless indescribable forms of the insects and flies, then I feel
the presence of the Almighty, who formed us in his own image, and the breath
of that universal love which bears and sustains us, as it floats around us in
an eternity of bliss.

\end{document}

With an extreme setting of LineWidth=1pt you get this beautiful output:

heavy text