[Tex/LaTex] Identity-H encoding, should it be avoided and if so, how

font-encodingsfonts

When I create a document with multiple fonts (using xelatex and xeCJK), subsets of all these fonts are included in the document.
The document’s properties shown in evince, in the Fonts tab, show the following:

  1. Most of the fonts are shown as name-style-Identity-H, e.g. LibertinusSans-Bold-Identity-H (but some don’t end on Identity-H).
  2. All of them further show either Type 1C (CID) or TrueType (CID).
  3. Below all of them there’s a line saying: Encoding: Identity-H.

(When I check the Fonts tab in a pdftex-created document I got from someone else, the fonts are all Type 1 and Encoding: Custom.
Other documents have other encodings or also Identity-H)

In this thread on the Adobe forums, someone writes that

To extract text when this encoding is used, the PDF also needs a “ToUnicode CMap”. You cannot see if one of these is present.

What does Identity-H really mean and how does xelatex behave concerning this?

Best Answer

In XeLaTeX, you normally do not use classical 8 Bit fonts, but OpenType or TrueType fonts. These can contain more than 256 glyphs, so in PDF files they can not eassily be represented as "Simple fonts" but have to be represented as "CID-Fonts".

For simple fonts, the encoding maps glyph indices between 0 and 255 to glyph names which are then used to identify the right glyph in the font.

CID fonts replace the glyph names with CIDs which are just another number. So for a CID font, the number used to represent the character in the PDF file is remapped via the encoding to a CID which then gets translated to the actual GID (the position of the glyph in the font file). Using predefined CID encodings (they are called CMaps, but they are not the same as ToUnicode CMaps) can help the PDF viewer to understand which character is represented by a special glyph, but similar to glyph names, this mapping sometimes needs adjustments and might depend on the PDF viewer. So it is often easier to manually set a ToUnicode CMap, a map which associates a Unicode value to every glyph. This is normally done in XeTeX and LuaTeX.

In this case, the CID scheme can be simplified: By using the CMap/encoding Identity-H, we basically say: "The character codes we write to the PDF files are already CIDs, you don't have to remap anything." This allows us to skip one step from the procedure. Additionally we usually use fonts, where the CID is also the same as the GID, to we can just write GIDs to the PDF file without having to worry about all this CID related stuff.

So basically Identity-H is saying: "We skip the entire encoding stage and handle all encoding related business in TeX directly." That should never be a problem, except that the PDF viewer can't map glyphs to Unicode values if you don't add a ToUnicode CMap. But by default, XeTeX does insert such a ToUnicode CMap, so you are all set. (An easy test to see if a ToUnicode CMap is present for a Identity-H font: If copy&paste mostly works, you certainly have a ToUnicode CMap)