[Tex/LaTex] OS X: umlauts in UTF8-NFD yield “Package inputenc Error: Unicode char \u8:̈ not set up for use with LaTeX”

input-encodingsunicode

After switching to OS X one of the first things I had to learn the hard way is that many non-ASCII characters, such as the German ü can be encoded in (at least) two different forms in UTF8:

  • U+00FC (LATIN SMALL LETTER U WITH DIAERESIS): Normalized Form C (NFC)
  • U+0075 U+0308 (LATIN SMALL LETTER U WITH COMBINING DIAERESIS): Normalized Form D (NFD)

(The glory details are all described here)

Basically, all operating systems and applications today use NFC only, with the exception of Mac OS X, in which some applications (e.g., OpenOffice or the HFS+ file system) use NFD. The result is that if you copy & paste some text from such an application (e.g., the output of the ls command) into your LaTeX document, everything looks fine.

\documentclass{article}
\usepackage[utf8]{inputenc} % comment out for lualatex/xelatex
\usepackage[T1]{fontenc}    % comment out for lualatex/xelatex

\begin{document}
äöüÄÖÜß
\end{document}

However, when compiling with pdflatex:

! Package inputenc Error: Unicode char \u8:̈ not set up for use with LaTeX.

A often given answer with respect to unicode problems is "use lualatex/xelatex". However, that does not seem to help here either. If compiling with lualatex/xelatex, the output does not contain the umlauts:

enter image description here

Question: The inputenc package with [utf8] is apparently not able not handle NFD. Is it possible to extend it so that the above does compile?


WARNING

Note that the MWE, if copied & pasted from here into a new document, actually does compile. Apparently either my browser or the SE site transparently transforms NFD to NFC. (For Safari and Crome that seems to be the case indeed; I have also tried Firefox without success). I have yet to figure out how to provide some piece of text in NFD here.


Excursus: A Bit of Extra Background on HFS+

I first stumbled over this issue when trying to put the output of a ls command into my LaTeX document: The source of many, many problems in OS X is that the HFS+ file system uses (for some totally weird reasons) NFD. Even worse: HFS+ transparently transforms all NFC characters it gets as input into NFD internally. Practically, this means that the filenames you get out are different than those you have put in: If you create a file ü (the keyboard delivers NFC) and then list the directory (the file system delivers NFD) , the name looks same, but in fact is different. A short illustration test (executed in an empty dir):

$ echo ü; echo ü | xxd; touch ü; ls; ls | xxd
ü
0000000: c3bc 0a                                  ...
ü
0000000: 75cc 880a                                u...

This is the reason so many tools (unison, svn, git, …) or bash's tab completion choke on OS X on filenames containing umlauts – and that you cannot use the output of ls directly in your LaTeX document.

Best Answer

(see possible solutions at the end.)

A survey of NFC and NFD UTF-8 forms in XeLaTeX input

xelatex almost handles NFD form almost out-of-the-box. You will need to load the xltxtra package, which you probably always want to load when using XeLaTeX, anyway.

Here's an example bash-script to create a test document (mkutest.sh):

#! /bin/bash
(
  TEXT="åäöüÅÄÖÜß"
  cat <<'EOF'
\documentclass{article}
\usepackage{xltxtra}
\begin{document}
EOF
  echo
  uconv -f utf-8 -t utf-8 -x nfc <<<"UTF-8-NFC: $TEXT"
  echo
  uconv -f utf-8 -t utf-8 -x nfd <<<"UTF-8-NFD: $TEXT"
  echo
  cat <<'EOF'
\end{document}
EOF
) > utest.tex

This script uses uconv (from ICU, See note 1 below) to create the two representations (NFC and NFD) of the same text and adds the XeLaTeX pre-/post-amble. This script should be "safe" to copy from the web page, since it uses the converter and the text input to it can be in any UTF-8 form. (See note 2 below for a version that does not depend on uconv.)

The created file looks like this (utest.tex):

\documentclass{article}
\usepackage{xltxtra}
\begin{document}

UTF-8-NFC: åäöüÅÄÖÜß

UTF-8-NFD: åäöüÅÄÖÜß

\end{document}

(Note: This may not yield the desired file if just copied from the web. See the warning in the question.)

The result of running this through XeLaTeX is a PDF with the text:

enter image description here

where the two lines does not look exactly the same (even apart from the label). The accents in the first line look OK, but the accents of the capital letters in the second line are vastly misaligned.

So, although XeLaTeX can handle NFD form, it may not do it properly...

If \usepackage{xltxtra} is omitted the PDF looks like:

example without the xltxtra package

which corroborates the example use of XeLaTeX in the question. Furthermore: Note that nothing at all shows up in the first row and the ß is missing on the second row. This is because the loaded fonts don't have the glyphs to render this. The xltxtra loads the package fontspec, which by default loads the font "Latin Modern". Without this only legacy fonts are loaded, which does not at all play nice with unicode text.

I have tested with different fonts (system fonts loaded with the fontspec command \setmainfont{<name of font>}). The result have been somewhat diverse. For all fonts that have the needed glyphs the first line looks correct. The second line, however, can come out in some different forms. For example with the accents after the base letters, as if they were non-combining; or with missing-glyph-boxes after the base letters...

As Khaled noted, XeTeX can normalize its input to NFC. Adding \XeTeXinputnormalization=1 to the preamble, before any non NFC-text is read, and still using \usepackage{xltxtra} and/or other means to set up proper fonts, the output is:

example with automatic NFC-normalization

This time the two lines does look exactly the same (apart from the label).


What to do?

If using XeTeX, \XeTeXinputnormalization=1 is definitely a solution. Just remember that you have to properly set up the fonts.

The other way to go, which works with all(?) programs that support UTF-8 NFC text input, is to convert the input files beforehand.

To massage the files into NFC form one can, for example, use uconv (from ICUSee note 1 below) as I did in the MWE-generator above.

$ uconv -o outfile.tex -f utf-8 -t utf-8 -x nfc infile.tex

(This works with UTF-16 encoding -- and others -- too. Just change the from (-f) and to (-t) options appropriately.)

Disclamer: Use this command at your own risk. Be sure to keep the original file until you can verify the result.

This should probably be safe to run on any (7-bit) ASCII or UTF-8 encoded tex file. If the file is already in NFC the conversion should not change anything, since it is idempotent. Files containing only 7-bit ascii are already in NFC, since 7-bit ASCII is a subset of UTF-8 and contains no combining characters that could make the text non-NFC.


Notes

  1. The uconv utility from ICU is in the package libiuc-dev on my Ubuntu 12.04 64-bit.
    (I think it is among the examples for the ICU4C library, but I could not find any info about the it from a quick search on the homepage. I'm a bit confused...)

  2. As requested by David in his comment I have made a version of the MWE-generator that does not depend on uconv.

    #!/bin/bash
    (
      echo '\documentclass{article}'
      echo '\usepackage{xltxtra}'
      echo '\begin{document}'
      echo
      echo -e 'UTF-8-NFC: \xc3\xa5\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x85\xc3\x84\xc3\x96\xc3\x9c\xc3\x9f'
      echo
      echo -e 'UTF-8-NFD: \x61\xcc\x8a\x61\xcc\x88\x6f\xcc\x88\x75\xcc\x88\x41\xcc\x8a\x41\xcc\x88\x4f\xcc\x88\x55\xcc\x88\xc3\x9f'
      echo
      echo '\end{document}'
    ) > utest.tex
    

    This version only depends on that echo -e interprets \xHH (and that echo without -e does not).

    I kept the other version (above, in the main text) since it allows for easy changes in the sample text.

    For the interested, the hex escapes are generated by uconv -x '[:Cc:]>; ::nfc;' <<<"$TEXT" | hexdump -v -e '/1 "%02x "' | sed -e 's/[[:xdigit:]][[:xdigit:]]/\\x\0/g; s/ //g' for NFC, &sim. for NFD.