[Tex/LaTex] How are the glyph (character) names in PDF-files determined


PDF-files make internal use of glyph names. For example, the name of (U+2248; TeX \approx) appearing in a PDF-file might be approxequal.

One can find such names in a TeX-generated PDF-file by

  1. compiling the TeX code with \pdfcompresslevel=0,
  2. inspecting the resulting PDF-file as a text file, and
  3. looking for lines starting with /CharSet.

(information taken from Ulrike Fischer's answer elsewhere, which provides more information).

Apparently the glyph names are font-dependent. So they are determined by the fonts? Do all font formats use such names? Which font formats use textual names? Do all glyphs in all PDF-files have such names?

How are the glyph names in PDF-files determined? Who determined the existing ones? What are they for? (Why doesn't PDF refer to the glyphs by number? Clearly some readers are relying on the glyph names (see link to question about hyperlink detection below), so the PDF format or some readers make some assumptions about these names. There must be a reason about why an intermediary of names is used. Perhaps this has to do with the age of Unicode in relation to PDF.) What else is there to know on this topic for a user of (La)TeX?

For me, the issue of PDF glyph names came up here:

A similar question is How to find the proper glyph name required by \pdfglyphtounicode, but there is more ground that needs to be covered in this topic.

Best Answer

it's my understanding that the glyph names are determined by the font. (note use of the term "glyph"; characters and glyphs are related, but are not interchangeable. but that's another story.)

it's also my understanding that the names supplied by the font depend on the supplier of the font -- they may be "meaningful" in some way (e.g., an ascii letter, a unicode, a descriptive name, ...) or they may just be a supplier's internal code, as used to be the situation in the days of metal type (as shown in old monotype technical symbols listings).

things may change, but ... don't hold your breath.

adding to what ulrike has said, unicode also uses names as well as numbers. an important (but possibly irrelevant point) here is that, once both a name and a number are assigned, they are never changed, even should the name prove to be wrong, or just ill-advised.

a second point is that some glyphs are not necessarily named by a single unique unicode. a unicode is supposed to define meaning, not shape. "variant" glyphs (with the same meaning but different shape) may be represented by multiple unicodes, in two principal ways:

  • by using a combining diacritic, as \nvarleq is a compound of \leq (U+2264) and U+20D2, "combining long vertical overlay"; almost no relations negated by a vertical cancellation are represented by single unicodes, and unless the basic principles of unicode assignment change, this will remain the norm.

  • by adding a defined "variation selector" (U+FE00) to designate recognized (i.e., officially by unicode) variants that are unable to be modified by addition of a combining diacritic, such as \lvertneqq (less than but not equal to with vertical negation of only the equals sign, U+2268,U+FE00).

unicode technical report #25, unicode support for mathematics, deals with these methods in sections 2.17 and 2.18 (pages 26 ff.).

Related Question