[Tex/LaTex] How does output font encoding work in XeLaTeX/LuaLaTeX

fontsfontspecluatexxetex

In The LaTeX Companion (Addison-Wesley 2004) Mittlebach & Gossens' dedicated section 7.11.4 to discussing output encodings: the mapping from LICR, the internal representation of characters and symbols inside a LaTeX engine, to glyphs or combinations of glyphs available in a font file.

Since publication of the book a new generation of LaTeX engines, xetex and luatex, have come to the foreground. Unlike the old engines, the new ones make it easy – thanks to the fontspec package – to use any font installed on one's computer. Moreover, unlike the fontenc package, which was described in The LaTeX Companion as the cornerstone of the output encoding mechanism, fontspec requires no encoding (e.g. OT1, T1) to be specified.

While the new engines, unlike the traditional ones, read UTF-8-encoded input files, which makes specifying an input encoding unnecessary, the internal representation of the input inside the engines hasn't changed, as far as I know, yet the number of available fonts has increased dramatically. So it would seem that the problem of specifying an output encoding would remain as relevant to the new engines as it was (and is) to the traditional ones.

How does the process of output encoding, of mapping LICR to glyphs, work in the new engines?

Best Answer

In many ways, font encodings in XeTeX and LuaTeX work the same as in pdfTeX, etc. One loads a font and has to know how the glyphs are encoded so that the input can be mapped to the correct output.

The big difference is that both XeTeX and LuaTeX can load OpenType system fonts (.otf files). Unlike 'traditional' TeX fonts, which come in a range of encodings and have at most 256 slots available per font, OpenType fonts are laid out in Unicode. As both of these engines also use Unicode as their standard input encoding, this means there is a direct mapping from input to output and no manipulation is needed.

If one wishes to use a non-Unicode font with XeTeX or LuaTeX, the same approaches as for pdfTeX are required: pick the correct encoding and set up appropriate mechanisms. For LaTeX, that would mean using fontenc. Certainly it is possible to load a 'classical' .tfm font with XeTeX or LuaTeX and to get the 'expected' glyphs out.


However, there are some caveats. The key one is that hyphenation patterns are based on the font encoding, and for 'classical' TeX engine including XeTeX can only be read when making a format. This means that when making a format one needs to know something about the font encodings that will be used. As XeTeX is a Unicode engine, it loads the hyphenation patterns in a Unicode encoding. This requires that the fonts used are also Unicode-encoded if hyphenation is to be correct for all languages. The number of codepoints this affects is small (largely Unicode and T1 overlap), but it is an issue. For that reason, the LaTeX team relativity recently changed the default font encoding in XeTeX (and LuaTeX) from OT1 to Unicode: this may catch out the unwary trying low-level font manipulation.

It is worth noting that LuaTeX is a slightly different case to XeTeX as it can load hyphenation patterns at run time, and as it allows re-encoding of the input. As such, LuaTeX is in some ways more able to cope with non-Unicode input than XeTeX is. However, LuaTeX development is also very much focussed on a purely-Unicode pathway, and I would be wary of creating any new documents heavily using classical TeX fonts with LuaTeX as a result.

Related Question