[Tex/LaTex] OS X: umlauts in UTF8-NFD yield “Package inputenc Error: Unicode char \u8:̈ not set up for use with LaTeX”

input-encodingsunicode

After switching to OS X one of the first things I had to learn the hard way is that many non-ASCII characters, such as the German ü can be encoded in (at least) two different forms in UTF8:

U+00FC (LATIN SMALL LETTER U WITH DIAERESIS): Normalized Form C (NFC)
U+0075 U+0308 (LATIN SMALL LETTER U WITH COMBINING DIAERESIS): Normalized Form D (NFD)

(The glory details are all described here)

Basically, all operating systems and applications today use NFC only, with the exception of Mac OS X, in which some applications (e.g., OpenOffice or the HFS+ file system) use NFD. The result is that if you copy & paste some text from such an application (e.g., the output of the ls command) into your LaTeX document, everything looks fine.

\documentclass{article}
\usepackage[utf8]{inputenc} % comment out for lualatex/xelatex
\usepackage[T1]{fontenc}    % comment out for lualatex/xelatex

\begin{document}
äöüÄÖÜß
\end{document}

However, when compiling with pdflatex:

! Package inputenc Error: Unicode char \u8:̈ not set up for use with LaTeX.

A often given answer with respect to unicode problems is "use lualatex/xelatex". However, that does not seem to help here either. If compiling with lualatex/xelatex, the output does not contain the umlauts:

enter image description here

Question: The inputenc package with [utf8] is apparently not able not handle NFD. Is it possible to extend it so that the above does compile?

WARNING

Note that the MWE, if copied & pasted from here into a new document, actually does compile. Apparently either my browser or the SE site transparently transforms NFD to NFC. (For Safari and Crome that seems to be the case indeed; I have also tried Firefox without success). I have yet to figure out how to provide some piece of text in NFD here.

Excursus: A Bit of Extra Background on HFS+

I first stumbled over this issue when trying to put the output of a ls command into my LaTeX document: The source of many, many problems in OS X is that the HFS+ file system uses (for some totally weird reasons) NFD. Even worse: HFS+ transparently transforms all NFC characters it gets as input into NFD internally. Practically, this means that the filenames you get out are different than those you have put in: If you create a file ü (the keyboard delivers NFC) and then list the directory (the file system delivers NFD) , the name looks same, but in fact is different. A short illustration test (executed in an empty dir):

$ echo ü; echo ü | xxd; touch ü; ls; ls | xxd
ü
0000000: c3bc 0a                                  ...
ü
0000000: 75cc 880a                                u...

This is the reason so many tools (unison, svn, git, …) or bash's tab completion choke on OS X on filenames containing umlauts – and that you cannot use the output of ls directly in your LaTeX document.

Best Answer

(see possible solutions at the end.)

A survey of NFC and NFD UTF-8 forms in XeLaTeX input

xelatex almost handles NFD form almost out-of-the-box. You will need to load the xltxtra package, which you probably always want to load when using XeLaTeX, anyway.

Here's an example bash-script to create a test document (mkutest.sh):

#! /bin/bash
(
  TEXT="åäöüÅÄÖÜß"
  cat <<'EOF'
\documentclass{article}
\usepackage{xltxtra}
\begin{document}
EOF
  echo
  uconv -f utf-8 -t utf-8 -x nfc <<<"UTF-8-NFC: $TEXT"
  echo
  uconv -f utf-8 -t utf-8 -x nfd <<<"UTF-8-NFD: $TEXT"
  echo
  cat <<'EOF'
\end{document}
EOF
) > utest.tex

This script uses uconv (from ICU, See note 1 below) to create the two representations (NFC and NFD) of the same text and adds the XeLaTeX pre-/post-amble. This script should be "safe" to copy from the web page, since it uses the converter and the text input to it can be in any UTF-8 form. (See note 2 below for a version that does not depend on uconv.)

The created file looks like this (utest.tex):

\documentclass{article}
\usepackage{xltxtra}
\begin{document}

UTF-8-NFC: åäöüÅÄÖÜß

UTF-8-NFD: åäöüÅÄÖÜß

\end{document}

(Note: This may not yield the desired file if just copied from the web. See the warning in the question.)

The result of running this through XeLaTeX is a PDF with the text:

enter image description here

where the two lines does not look exactly the same (even apart from the label). The accents in the first line look OK, but the accents of the capital letters in the second line are vastly misaligned.

So, although XeLaTeX can handle NFD form, it may not do it properly...

If \usepackage{xltxtra} is omitted the PDF looks like:

example without the xltxtra package

which corroborates the example use of XeLaTeX in the question. Furthermore: Note that nothing at all shows up in the first row and the ß is missing on the second row. This is because the loaded fonts don't have the glyphs to render this. The xltxtra loads the package fontspec, which by default loads the font "Latin Modern". Without this only legacy fonts are loaded, which does not at all play nice with unicode text.

I have tested with different fonts (system fonts loaded with the fontspec command \setmainfont{<name of font>}). The result have been somewhat diverse. For all fonts that have the needed glyphs the first line looks correct. The second line, however, can come out in some different forms. For example with the accents after the base letters, as if they were non-combining; or with missing-glyph-boxes after the base letters...

As Khaled noted, XeTeX can normalize its input to NFC. Adding \XeTeXinputnormalization=1 to the preamble, before any non NFC-text is read, and still using \usepackage{xltxtra} and/or other means to set up proper fonts, the output is:

example with automatic NFC-normalization

This time the two lines does look exactly the same (apart from the label).

What to do?

If using XeTeX, \XeTeXinputnormalization=1 is definitely a solution. Just remember that you have to properly set up the fonts.

The other way to go, which works with all(?) programs that support UTF-8 NFC text input, is to convert the input files beforehand.

To massage the files into NFC form one can, for example, use uconv (from ICU^{See note 1 below}) as I did in the MWE-generator above.

$ uconv -o outfile.tex -f utf-8 -t utf-8 -x nfc infile.tex

(This works with UTF-16 encoding -- and others -- too. Just change the from (-f) and to (-t) options appropriately.)

Disclamer: Use this command at your own risk. Be sure to keep the original file until you can verify the result.

This should probably be safe to run on any (7-bit) ASCII or UTF-8 encoded tex file. If the file is already in NFC the conversion should not change anything, since it is idempotent. Files containing only 7-bit ascii are already in NFC, since 7-bit ASCII is a subset of UTF-8 and contains no combining characters that could make the text non-NFC.

Notes

The uconv utility from ICU is in the package libiuc-dev on my Ubuntu 12.04 64-bit.
(I think it is among the examples for the ICU4C library, but I could not find any info about the it from a quick search on the homepage. I'm a bit confused...)
As requested by David in his comment I have made a version of the MWE-generator that does not depend on uconv.
```
#!/bin/bash
(
  echo '\documentclass{article}'
  echo '\usepackage{xltxtra}'
  echo '\begin{document}'
  echo
  echo -e 'UTF-8-NFC: \xc3\xa5\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x85\xc3\x84\xc3\x96\xc3\x9c\xc3\x9f'
  echo
  echo -e 'UTF-8-NFD: \x61\xcc\x8a\x61\xcc\x88\x6f\xcc\x88\x75\xcc\x88\x41\xcc\x8a\x41\xcc\x88\x4f\xcc\x88\x55\xcc\x88\xc3\x9f'
  echo
  echo '\end{document}'
) > utest.tex
```
This version only depends on that echo -e interprets \xHH (and that echo without -e does not).

I kept the other version (above, in the main text) since it allows for easy changes in the sample text.

For the interested, the hex escapes are generated by uconv -x '[:Cc:]>; ::nfc;' <<<"$TEXT" | hexdump -v -e '/1 "%02x "' | sed -e 's/[[:xdigit:]][[:xdigit:]]/\\x\0/g; s/ //g' for NFC, &sim. for NFD.

Related Solutions

[Tex/LaTex] Using “◌” in Latex – “Unicode char \u8:Γùî not set up for use with LaTeX”

As suggested in the comments, you could set your font to one that supports Unicode (this solution requires XeLaTeX):

\documentclass[12pt]{article}
\usepackage[a4paper]{geometry}
\usepackage{fontspec}

\setmainfont{Arial Unicode MS} % Other fonts that might work: Hiragino Maru Gothic Pro, Lao Sangam MN, Inaimathi, Microsoft Sans Serif

\begin{document}\huge

\noindent a b c \\
◌ \\
ä ë \\
à è \\
ç 

\end{document}

[Tex/LaTex] XeLaTeX, LuaLaTeX, fontspec, unicode and normalization

I don't know the answer for first two questions, as I don't use XeTeX, but I want to provide option for the third question.

Thanks to Arthur's code I was able to create basic package for unicode normalization in LuaLaTeX. The code needed only slight modifications to work with current LuaTeX. I will post only main Lua file here, full project is available on Github as uninormalize.

Sample usage:

\documentclass{article}
\usepackage{fontspec}
\usepackage[czech]{babel}
\setmainfont{Linux Libertine O}
\usepackage[nodes,buffer=false, debug]{uninormalize}
\begin{document}

Some tests:
\begin{itemize}
  \item combined letter ᾳ %GREEK SMALL LETTER ALPHA (U+03B1) + COMBINING GREEK YPOGEGRAMMENI (U+0345)
  \item normal letter ᾳ% GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI (U+1FB3)
\end{itemize}

Some more combined and normal letters: 
óóōōöö

Linux Libertine does support some combined chars: \parbox{4em}{příliš}
\end{document}

(note that correct version of this file is on Github, combined letters were transferred incorrectly in this example)

Main idea of the package is following: process the input, and when letter followed by combined marks is found, then it is replaced by normalized NFC form. Two methods are provided, my first approach was to use node processing callbacks to replace decomposed glyphs with normalized characters. This would have a advantage in that it would be possible to switch on and off the processing anywhere, using node attributes. The other possible feature could be checking if the current font contains normalized character and use original form if it doesn't. Unfortunately, in my tests it fails with some characters, notably composed í is in the nodes as dotless i + ´, instead of i + ´, which after the normalization doesn't produce the correct character, so composed chars are used instead. But this produce output with bad placing of the accent. So this method needs either some correction, or it is totally wrong.

So the other method is to use process_input_buffer callback to normalize the input file as it is read from the disk. This method doesn't allow to use info from fonts, nor it allows to turning off in the middle of the line, but it is significantly easier to implement, the callback function may look like this:

function buffer_callback(line) 
  return NFC(line)
end

which is really nice finding after three days spent on node processing version.

For curiosity this is the Lua package:

local M = {}
dofile("unicode-names.lua")
dofile('unicode-normalization.lua')
local NFC = unicode.conformance.toNFC
local char = unicode.utf8.char
local gmatch = unicode.utf8.gmatch
local name = unicode.conformance.name
local byte = unicode.utf8.byte
local unidata = characters.data
local length = unicode.utf8.len

M.debug = false

-- for some reason variable number of arguments doesn't work
local function debug_msg(a,b,c,d,e,f,g,h,i)
  if M.debug then
    local t = {a,b,c,d,e,f,g,h,i}
    print("[uninormalize]", unpack(t))
  end
end

local function make_hash (t) 
  local y = {}
  for _,v in ipairs(t) do 
    y[v] = true
  end
  return y
end

local letter_categories = make_hash {"lu","ll","lt","lo","lm"}

local mark_categories = make_hash {"mn","mc","me"}

local function printchars(s)
    local t = {}
    for x in gmatch(s,".") do
        t[#t+1] = name(byte(x))
    end
    debug_msg("characters",table.concat(t,":"))
end

local categories = {}


local function get_category(charcode)
  local charcode = charcode or ""
  if categories[charcode] then
    return categories[charcode] 
  else
    local unidatacode = unidata[charcode] or {}
    local category = unidatacode.category
    categories[charcode] = category
    return category
  end
end

-- get glyph char and category
local function glyph_info(n)
  local char = n.char
  return char, get_category(char)
end

local function get_mark(n)
  if n.id == 37 then
    local character, cat = glyph_info(n)
    if mark_categories[cat] then
      return char(character)
    end
  end
  return false
end

local function make_glyphs(head, nextn,s, lang, font, subtype) 
  local g = function(a) 
    local new_n = node.new(37, subtype)
    new_n.lang = lang
    new_n.font = font
    new_n.char = byte(a)
    return new_n
  end
  if length(s) == 1 then
    return node.insert_before(head, nextn,g(s))
  else
    local t = {}
    local first = true
    for x in gmatch(s,".") do
      debug_msg("multi letter",x)
        head, newn = node.insert_before(head, nextn, g(x))
    end
    return head
  end
end

local function normalize_marks(head, n)
  local lang, font, subtype = n.lang, n.font, n.subtype
  local text = {}
  text[#text+1] = char(n.char)
  local head, nextn = node.remove(head, n)
  --local nextn = n.next
  local info = get_mark(nextn)
  while(info) do
    text[#text+1] = info
    head, nextn = node.remove(head,nextn)
    info = get_mark(nextn)
  end
  local s = NFC(table.concat(text))
  debug_msg("We've got mark: " .. s)
  local new_n = node.new(37, subtype)
  new_n.lang = lang
  new_n.font = font
  new_n.char = byte(s)
  --head, new_n = node.insert_before(head, nextn, new_n)
  -- head, new_n = node.insert_before(head, nextn, make_glyphs(s, lang, font, subtype))
  head, new_n = make_glyphs(head, nextn, s, lang, font, subtype)
  local t = {}
  for x in node.traverse_id(37,head) do
    t[#t+1] = char(x.char)
  end
  debug_msg("Variables ", table.concat(t,":"), table.concat(text,";"), char(byte(s)),length(s))
  return head, nextn
end

local function normalize_glyphs(head, n)
  --local charcode = n.char
  --local category = get_category(charcode)
  local charcode, category = glyph_info(n)
  if letter_categories[category] then 
    local nextn = n.next
    if nextn.id == 37 then
      --local nextchar = nextn.char
      --local nextcat = get_category(nextchar)
      local nextchar, nextcat = glyph_info(nextn)
      if mark_categories[nextcat] then
        return normalize_marks(head,n)
      end
    end
  end
  return head, n.next 
end


function M.nodes(head)
    local t = {}
    local text = false
  local n = head
    -- for n in node.traverse(head) do
  while n do
        if n.id == 37 then
      local charcode = n.char
            debug_msg("unicode name",name(charcode))
            debug_msg("character category",get_category(charcode))
            t[#t+1]= char(charcode)
            text = true
      head, n = normalize_glyphs(head, n)
        else
            if text then
                local s = table.concat(t)
                debug_msg("text chunk",s)
                --printchars(NFC(s))
                debug_msg("----------")
            end
            text = false
            t = {}
      n = n.next
        end
    end
    return head
end

--[[
-- These functions aren't needed when processing buffer. We can call NFC on the whole input line
local unibytes = {}

local function get_charcategory(s)
  local s = s or ""
  local b = unibytes[s] or byte(s) or 0
  unibytes[s] = b
  return get_category(b)
end

local function normalize_charmarks(t,i)
  local c = {t[i]}
  local i = i + 1
  local s = get_charcategory(t[i])
  while mark_categories[s] do
    c[#c+1] = t[i]
    i = i + 1
    s = get_charcategory(t[i])
  end
  return NFC(table.concat(c)), i
end

local function normalize_char(t,i)
  local ch = t[i]
  local c = get_charcategory(ch)
  if letter_categories[c] then
    local nextc = get_charcategory(t[i+1])
    if mark_categories[nextc] then
      return normalize_charmarks(t,i)
    end
  end
  return ch, i+1
end
-- ]]
function M.buffer(line)
  --[[
  local t = {}
  local new_t = {}
  -- we need to make table witl all uni chars on the line
  for x in gmatch(line,".") do
    t[#t+1] = x
  end
  local i = 1
  -- normalize next char
  local c, i = normalize_char(t, i)
  new_t[#new_t+1] = c
  while t[i] do
    c, i = normalize_char(t,i)
    -- local  c = t[i]
    -- i =  i + 1
    new_t[#new_t+1] = c
  end
  return table.concat(new_t)
  --]]
  return NFC(line)
end


return M

and now is the time for some pictures.

without normalization:

enter image description here

you can see that composed Greek char is wrong, other combinations are supported by Linux Libertine

with node normalization:

enter image description here

Greek letters are correct, but í in first příliš is wrong. this is the issue I was talking about.

and now the buffer normalization:

enter image description here

everything is alright now

Best Answer

A survey of NFC and NFD UTF-8 forms in XeLaTeX input

What to do?

Notes

Related Solutions

[Tex/LaTex] Using “◌” in Latex – “Unicode char \u8:Γùî not set up for use with LaTeX”

[Tex/LaTex] XeLaTeX, LuaLaTeX, fontspec, unicode and normalization

Related Question