[Tex/LaTex] Identity-H encoding, should it be avoided and if so, how

font-encodingsfonts

When I create a document with multiple fonts (using xelatex and xeCJK), subsets of all these fonts are included in the document.
The document’s properties shown in evince, in the Fonts tab, show the following:

Most of the fonts are shown as name-style-Identity-H, e.g. LibertinusSans-Bold-Identity-H (but some don’t end on Identity-H).
All of them further show either Type 1C (CID) or TrueType (CID).
Below all of them there’s a line saying: Encoding: Identity-H.

(When I check the Fonts tab in a pdftex-created document I got from someone else, the fonts are all Type 1 and Encoding: Custom.
Other documents have other encodings or also Identity-H)

In this thread on the Adobe forums, someone writes that

To extract text when this encoding is used, the PDF also needs a “ToUnicode CMap”. You cannot see if one of these is present.

What does Identity-H really mean and how does xelatex behave concerning this?

Best Answer

In XeLaTeX, you normally do not use classical 8 Bit fonts, but OpenType or TrueType fonts. These can contain more than 256 glyphs, so in PDF files they can not eassily be represented as "Simple fonts" but have to be represented as "CID-Fonts".

For simple fonts, the encoding maps glyph indices between 0 and 255 to glyph names which are then used to identify the right glyph in the font.

CID fonts replace the glyph names with CIDs which are just another number. So for a CID font, the number used to represent the character in the PDF file is remapped via the encoding to a CID which then gets translated to the actual GID (the position of the glyph in the font file). Using predefined CID encodings (they are called CMaps, but they are not the same as ToUnicode CMaps) can help the PDF viewer to understand which character is represented by a special glyph, but similar to glyph names, this mapping sometimes needs adjustments and might depend on the PDF viewer. So it is often easier to manually set a ToUnicode CMap, a map which associates a Unicode value to every glyph. This is normally done in XeTeX and LuaTeX.

In this case, the CID scheme can be simplified: By using the CMap/encoding Identity-H, we basically say: "The character codes we write to the PDF files are already CIDs, you don't have to remap anything." This allows us to skip one step from the procedure. Additionally we usually use fonts, where the CID is also the same as the GID, to we can just write GIDs to the PDF file without having to worry about all this CID related stuff.

So basically Identity-H is saying: "We skip the entire encoding stage and handle all encoding related business in TeX directly." That should never be a problem, except that the PDF viewer can't map glyphs to Unicode values if you don't add a ToUnicode CMap. But by default, XeTeX does insert such a ToUnicode CMap, so you are all set. (An easy test to see if a ToUnicode CMap is present for a Identity-H font: If copy&paste mostly works, you certainly have a ToUnicode CMap)

Related Solutions

[Tex/LaTex] Proper use of cmap and mmap

The package mmap does a little bit more than cmap, it also works for mathematical symbols in your pdf.

So if your pdf does not use mathematics use \usepackage{cmap}. If you have problems with ligatures further on with computer modern use \usepackage[resetfonts]{cmap}. With mathemtic symbols use \usepackage{mmap}. If you have still problems use \usepackage[noTeX]{mmap}.

The differences are:

\usepackage{cmap}: accepted preloaded fonts without reloading.
\usepackage[resetfonts]{cmap}: as you can read in the README of cmap this forces the reloading of preloaded fonts (Computer Modern).
\usepackage[useTeX]{mmap} and \usepackage{mmap}: does everything cmap does plus correcting mathematical symbols in your pdf; uses new -m.cmap files ("uses ascii strings for the macro-names").
\usepackage[noTeX]{mmap}: does everything cmap does plus correcting mathematical symbols in your pdf; uses the cmap files (unicode).

Load cmap or mmap first, then fontenc and babel.

The documentation of fixltx2e does only say "load in the preamble". I had no problems loading it after fontenc, babel and the used fonts.

To do your own experiments use the follwing MWE:

\listfiles                      % shows used files
\documentclass[12pt]{scrartcl}
%\usepackage{cmap}              % pure T1 fonts 
%\usepackage[resetfonts]{cmap}  % pure T1 fonts, reset CM
%\usepackage{mmap}              % cmap + mathematics (ASCII)
%\usepackage[noTeX]{mmap}       % cmap + mathematics (Unicode)

 \usepackage[Latin9]{inputenc}  % or utf-8
 \usepackage[T1]{fontenc}       % font encription 
%\usepackage[T3,T1]{fontenc}    % T3 for package tipa
%\usepackage{tipa}              % Phonetic alphabet
 \usepackage[ngerman]{babel}    % neue deutsche Rechtschreibung

%\usepackage{lmodern}           % Latin Modern
%\usepackage{tgpagella}         % has no virtual fonts
%\usepackage[osf]{mathpazo}     % Minuskelziffern okay

%\usepackage{libertine}         % Libertine Legacy (with virtual fonts)
 \usepackage[osf]{libertine}    % mit Medivalziffern bzw. Minuskelziffern

\newcommand*{\III}{\libertineGlyph{Threeroman}}
\newcommand*{\IV}{\libertineGlyph{Fourroman}}


\begin{document}

Römische Zahlen: \III, \IV.

\textsc{Ligaturen}: auffliegen auffinden finden Auflage Schifffahrt.

\textsc{Korrekt}: auf\/fliegen auf\/finden finden Auf\/lage Schiff\/fahrt.

Ziffern: 0123456789.

Donau Donaudampfschiff Donaudampfschifffahrt Donaudampfschifffahrtskapitän 
Donaudampfschifffahrtskapitän 
Donaudampfschifffahrtskapitän Donaudampfschifffahrtskapitän 
Donaudampfschifffahrtskapitän Donaudampfschifffahrtskapitän

%\textipa{[\!b] [\:r] [\;B]}

\end{document}

Set or delete the comment sign to test cmap and mmap with or without fontenc and different fonts.

BTW: "Donaudampfschifffahrtskapitän" is a German word, good to get hyphenations.

[Tex/LaTex] How to avoid Type3 fonts when submitting to ManuscriptCentral

I had the same problem. I replaced a \mathbbm{1} command I had with \mathds{1} and the issue has been resolved: no Type 3 Font Error in Manuscript Central.

Best Answer

Related Solutions

[Tex/LaTex] Proper use of cmap and mmap

[Tex/LaTex] How to avoid Type3 fonts when submitting to ManuscriptCentral

Related Question