[Tex/LaTex] Using and interpreting pdffonts

fontspdf

For a fairly large (> 100 pages) document that I am writing, I have run pdffonts to check whether the fonts are suitably embedded. The output is as follows:

C:\>pdffonts main.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
PEUMGT+Utopia-Regular                Type 1C           yes yes no      10  0
QIAYNS+Utopia-Bold                   Type 1C           yes yes no       8  0
XUFKIZ+Utopia-Italic                 Type 1C           yes yes no      61  0
CVIUTI+Fourier-Math-Letters-Italic   Type 1C           yes yes no     270  0
YJVFRW+Fourier-Math-Symbols          Type 1C           yes yes yes    282  0
LPRTGE+Fourier-Math-Extension        Type 1C           yes yes yes    332  0
UYVFMY+Fourier-Math-Letters-Bold-Italic Type 1C           yes yes no     592  0

I have been using the \include{fourier} package so as to have the fourier fonts which I like a lot both for math and for regular use. I see from the font output table that I have some Utopia fonts as well which are from Adobe as mentioned in the fourier package documentation. I have three questions:

  1. I would like to know what the "random" letters before the font name in the table means (e.g. on line one we have PEUMGT).

  2. I would like to learn how to interpret the font output table better. In the last second to last column, we have in the final row the number 592. What does this mean?

  3. Where can I find more information on the pdffonts command?

Best Answer

AFAICS, questions No. 1+2 have not yet been fully answered...

1.

'I would like to know what the "random" letters before the font name in the table means (e.g. on line one we have PEUMGT).'

  • These letters are a prefix to the original fontname and they indicate that the font was embedded, but NOT as the full set of glyphs available for this font, but only as a subset. According to the PDF spec, the fontname prefix should indeed be random and unique when compared to other subset fontnames using the same full font.

2.

'I would like to learn how to interpret the font output table better. In the last second to last column, we have in the final row the number 592. What does this mean?'

  • As Herbert already said, PDF files contain objects which are numbered (and have a "generation" sub-number, which in most cases is 0). If you want to look up the exact PDF code for the object 592 generation 0, you should search for the section in the PDF starting with the line 592 0 obj, ending with the line endobj. Everything in between defines this object. However, some other objects may be referenced: if you find strings saying 691 0 R you know to look for object 691, generation 0 now in the same way as you looked for object 592 initially.
  • And if you want to know at which and at how many places in the PDF your 592 object is used, search for all occurences of 592 0 R...

Update:

3.

What does the values in the uni column mean? (Actually, not asked by the OP, but added by myself because it fits the context... :)

  • Values in the uni column indicate whether the font in question is accompanied by a /ToUnicode table (a separate object in the PDF, if present). This table provides a reverse mapping from "character codes" to unicode characters or code points.

  • Without a correct and valid /ToUnicode table, any text extraction will most likely fail for Custom-encoded fonts and result in unreadable garbage:

    • pdftotext will not work as expected;
    • screen readers will not be able to read PDF contents aloud to users who may need it;
    • copy'n'paste will not work as expected.

    You can test this by opening any PDF in a text editor. If you find the string /ToUnicode inside, change it to /toUnicode. (That change in capitalization will make the case sensitive keyword no longer be found.) After that, your PDF will still display identically, but text extraction will no longer work (for these fonts which the disabled /ToUnicode tables where serving).

    [You may now ask: how is the text then still displayed correctly inside the viewer? The reason is that the forward mapping of the (Unicode) characters or code points (to glyph shapes to be drawn for them) is using a different mechanism... see point 4.]

4.

Newer versions of pdffonts display an additional column, encoding. It looks like this:

name                       type       encoding         emb sub uni object ID
-------------------------- ---------- ---------------- --- --- --- ---------
UBYABV+CMR10               Type 1C    Builtin          yes yes no       8  0

What do the values in the encoding column mean? (Question also added by me :)

  • The 'encoding' of a font represents actually the mentioned forward mapping (see point 3) from a "character code" to the glyph ID inside the font so that the PDF renderer knows how to draw a particular glyph representing the char code. (Note, that the technical term "char code" in this context is not the same as "letter" or "character". The "char code" for expressing the letter "a" in a PDF text object may be "z", or anything else.)

  • There are various mechanisms for font encodings:

    • Base encodings (for Type 1 fonts): amongst these are StandardEncoding, WinAnsiEncoding and MacRomanEncoding. These are well-known to the PDF reader and need not to be embedded in the PDF.
    • Custom encodings (for Type 1 fonts): these are based a named base encoding, but modified by adding a /Differences array to the final font encoding.
    • Identity-H, Identity-V encodings: standard encodings for CID font types (fonts with many more than the max. 256 different glyphs which a Type 1 font can contain).
    • Builtin encodings: a built-in encoding is what every font program (except Type 3 fonts) must include.

    So why do PDFs not always use the 'builtin' encodings? Because they do not always embed the full font program. Sometimes they embed only that subset of glyphs which actually occur in the PDF document.