[Tex/LaTex] How does XeTeX typeset (math) symbols which are input as unicode

fontssymbolsunicodexetex

I'm considering moving to Xe(La)TeX for the main reason that it allows me to use unicode in my LaTeX code, making said code easier to read, especially the math.

But I'm confused on how XeTeX typesets symbols based on my input. I see three possibilities:

  1. Unicode chars are made active and XeTeX outputs the TeX symbols we know and trust.
  2. Unicode chars are piped (directly) through to the final (pdf, ps, dvi) document.
  3. A combination of the two.

With one of the advertised points of XeTeX being direct access to system UTF-8 fonts, I'm guessing (2) has a lot to do with it.

Beautiful Typesetting

But TeX has always been about beautiful typesetting, and a lot of effort has gone into making symbols and their spacing look good. Do we still get the same benefits with unicode output? (Do the symbols look the same? Nicer? Worse?) I believe some symbols do not come directly from a font, but have been painstakingly crafted in TeX itself.

Specifics

There are specific math constructs that get special treatment in TeX but also have unicode symbols. For example, for \cap and \bigcap there are respectively ∩ and ⋂. Do they both behave accordingly? What about √? Or are there packages that implement this sort of thing?

Are most unicode math symbols interpreted correctly with regard to math spacing? (\mathbin, \mathrel, \mathop, \mathopen, \mathclose)

Do math delimiters derived from unicode ⦃⦄ ⦅⦆ scale vertically as they should?

Are Combining Diacritical Marks handled appropriately?

Portability

Will the output be different when the code is compiled on different systems? Will my generated pdf/ps/dvi look different when viewed on different systems? Or are all relevant fonts automatically included?

unicode-math

Finally, what role does unicode-math play in this story?

Best Answer

XeTeX introduced new primitives such as \Umathcode (up to version 0.9998 called \XeTeXmathcode, renamed for compatibility with LuaTeX) that's the Unicode analog of \mathcode.

What does \mathcode in traditional TeX? A declaration such as

\mathcode`+="202B

tells TeX that a + in math mode should be treated as a binary operation symbol (leftmost byte "2), taken from font family "0 and slot "2B in the corresponding font. In the same vein, one can say something like

\Umathcode`∑="1 "1 "2211

or even

\Umathcode`∑="1 "1 `∑

The primitive \Umathcode has the syntax

\Umathcode<Unicode point> = <math type> <family> <slot>

After the (optional) =, three numbers should be given, because packing the information into a single number as done by TeX is not possible. Actually the information is still packed into a single number (in this case it's decimal 18883089, hexadecimal "1202211), but the translation from packed number to explicit type-family-slot is not straightforward.

This will be probably accompanied by a similar declaration

\Umathchardef\sum="1 "1 "2211

so that typing $∑$ or $\sum$ will give the same result.

The unicode-math package loads a huge list of symbols and performs assignments similar to the one for . The number corresponding to will be different, because it depends on many aspects which can't be covered in a short answer.

Actually unicode-math does much more than this, because it sets things up so that commands such as \mathbf or \mathrm give the desired result.

There are other primitives corresponding to the traditional ones, namely \Umathchar, for using a directly specified character, or \Udelimiter for setting delimiters with normal and large variant, \Umathaccent and finally \Uradical for defining root symbols. See texdoc xetex that will open “The XeTeX reference guide” by Will Robertson and Khaled Hosny.

Related Question