[Tex/LaTex] To input symbols with utf8 codes beyond the standard utf8 packet in Latex

accentsinput-encodingsunicode

I have input text for Latex with symbols like Ẃ ẃ Ẅ ẅ Ỳ ỳ and so on and must translate it into pdf. The task came from the whole work of export from Library of Congress catalog to prepare article in LaTex. So lot of symbols in utf8 codes but not in the form of Latex comands just fall into Latex input. When translation begins there are error message in the form: Package inputenc Error: Unicode char \u8:ĭ not set up for use with LaTeX. – with different utf8 symbols besides the standard list. All of them are like "Latin Accent" but of different form: Acute, Grave, Circumflex, Dieresis, Tilde, Caron, Macron, Breve, Overdot, Underdot. So to overcome this I use in preambula commands like \DeclareUnicodeCharacter{0174}{$\hat{W}$}. It works but such symbols are a lot of. Do somebody knows packet to do such work totally on all such symbols? The example with text producing a lot of error messages is here:

\documentclass[12pt]{article}

\usepackage[utf8]{inputenc}

\usepackage{cmap}       % good PDF (pdflatex)

\usepackage[T1,T2A]{fontenc}   

\usepackage[english,russian]{babel}

\DeclareUnicodeCharacter{02B9}{\textquoteright}   % modifier letter prime
% like apostrof 

% Welsh and Cornish languages
\DeclareUnicodeCharacter{0174}{$\hat{W}$}

\DeclareUnicodeCharacter{0175}{$\hat{w}$}

\DeclareUnicodeCharacter{0176}{$\hat{Y}$}

\DeclareUnicodeCharacter{0177}{$\hat{y}$}

\usepackage{amsmath}

\usepackage{amssymb}

\usepackage{wasysym}

\begin{document}

Acides nucléiques et synthèse des protéines chez les végétaux: {[}actes du colloque international{]}, Strasbourg, 15-24 juillet 1976

Russko-moldavskiĭ zoologicheskiĭ slovarʹ. Gani︠a︡

Russkiĭ narodnyĭ lubok

Slovarʹ-spravochnik po shelkovodstvu. Gani︠a︡

Aėrokosmichesk kurʹer =: Aerospace courier

\end{document}

Environment:
Windows 7
Miktex 2.9
Texworks
pdflatex

Question: Do somebody knows packet to do such work totally on all such symbols?

Best Answer

Use xelatex or lualatex and choose a font that has the symbols you need. Below I've used the Doulos SIL font.

\documentclass{article}
\usepackage{fontspec}
\setmainfont{Doulos SIL}  
\begin{document}
Acides nucléiques et synthèse des protéines chez les végétaux: {[}actes du colloque international{]}, Strasbourg, 15-24 juillet 1976

Russko-moldavskiĭ zoologicheskiĭ slovarʹ. Gani︠a︡

Russkiĭ narodnyĭ lubok

Slovarʹ-spravochnik po shelkovodstvu. Gani︠a︡

Aėrokosmichesk kurʹer =: Aerospace courier
\end{document}

enter image description here

Related Solutions

[Tex/LaTex] Is direct utf8 input of combining diacritics in math mode possible with lualatex

I got it working with a lua script. Your minimal example becomes:

\documentclass{minimal}
\usepackage{unicode-math}
\setmathfont{XITS Math}
\AtBeginDocument{\directlua{require("combining_preprocessor.lua")}}
\newcommand{\⃗}[1]{\ensuremath{\vec{#1}}}
\begin{document}
$v⃗$
\end{document}

The idea is that it's difficult to make LaTeX handle a command or macro that comes after its argument, which is how Unicode combining characters work, so we use would like a preprocessor to move the accent so it comes before its argument. That is, map v⃗ to \⃗{v} in a script, and then define whatever action you want \⃗ to have. (That's a backslash followed by a combining arrow, which should be printed above the backslash.)

My lua script does most (all?) of the combining characters, so you just need to define what they should do in the .tex file. Many accents on the same character is possible. Example:

\documentclass{minimal}

\usepackage{unicode-math}
\setmathfont{XITS Math}

\AtBeginDocument{\directlua{require("combining_preprocessor.lua")}}

\newcommand{\̂}[1]{\ensuremath{\hat{#1}}}
\newcommand{\⃑}[1]{\ensuremath{\vec{#1}}}
\newcommand{\̱}[1]{\ensuremath{\underline{#1}}}
\newcommand{\́}[1]{\ensuremath{\acute{#1}}}

\usepackage{stackrel}
\newcommand{\᷽}[1]{\ensuremath{\stackrel[\approx]{}{#1}}}

\begin{document}

Hello

$ℂ̂$ is hat on $ℂ$, more on $ℂ̂⃑$ (stress test)

$ℂ̂ x̂$

Many combining accents on $x᷽̱̂́⃑$ is cool.

\end{document}

(My browser doesn't do the many combining characters justice here, but it looks nice in the PDF file.)

Not sure if this is the ideal way of doing things, but for what it's worth, here is combining_preprocessor.lua:

function minornil(a, b)
   if a == nil and b == nil then
      return nil
   elseif a == nil then
      return b
   elseif b == nil then
      return a
   else
      return math.min(a, b)
   end
end

function findfirstcombining(line, n)
   local a = string.find(line, "\204[\128-\191]", n)     -- From U0300,
   local b = string.find(line, "\205[\128-\175]", n)     -- to U036F.
   a = minornil(a, b)
   b = string.find(line, "\226\131[\144-\176]", n) -- U20D0 to U20F0
   a = minornil(a, b)
   b = string.find(line, "\225\183[\128-\191]", n) -- U1DC0 to U1DFF
   a = minornil(a, b)
   return a
end

function is_utf8_continuation(byte)
   return byte < 191 and byte > 127
end

function find_next_utf8_char(str, n)
   while str:byte(n) ~= nil and is_utf8_continuation(str:byte(n)) do
      n = n + 1
   end
   return n
end

function combining_iter(str)
   local n = 0
   return function ()
      n = (n ~= nil) and findfirstcombining(str, n + 1)
      return n
   end
end

function dobuffer(line)
   local n1 = 0
   local t = {}
   for n2 in combining_iter(line) do
      if n2 > n1 then
         local n3 = n2
         repeat
            n3 = n3 - 1
         until not is_utf8_continuation(line:byte(n3))
         table.insert(t, string.sub(line, n1, n3 - 1))
         n1 = find_next_utf8_char(line, n2 + 1)
         local comb = {}
         table.insert(comb, "\\" .. string.sub(line, n2, n1 - 1) .. "{")
         table.insert(comb, string.sub(line, n3, n2 - 1) .. "}")
         n2 = findfirstcombining(line, n1)
         while n2 == n1 do
            n1 = find_next_utf8_char(line, n2 + 1)
            table.insert(comb, 1, "\\" .. line:sub(n2, n1 - 1) .. "{")
            table.insert(comb, "}")
            n2 = findfirstcombining(line, n1)
         end
         table.insert(t, table.concat(comb))
      end
   end
   table.insert(t, string.sub(line, n1))
   return table.concat(t)
end

luatexbase.add_to_callback("process_input_buffer",
                           dobuffer, "combining_preprocessor", 1)

[Tex/LaTex] OS X: umlauts in UTF8-NFD yield “Package inputenc Error: Unicode char \u8:̈ not set up for use with LaTeX”

(see possible solutions at the end.)

A survey of NFC and NFD UTF-8 forms in XeLaTeX input

xelatex almost handles NFD form almost out-of-the-box. You will need to load the xltxtra package, which you probably always want to load when using XeLaTeX, anyway.

Here's an example bash-script to create a test document (mkutest.sh):

#! /bin/bash
(
  TEXT="åäöüÅÄÖÜß"
  cat <<'EOF'
\documentclass{article}
\usepackage{xltxtra}
\begin{document}
EOF
  echo
  uconv -f utf-8 -t utf-8 -x nfc <<<"UTF-8-NFC: $TEXT"
  echo
  uconv -f utf-8 -t utf-8 -x nfd <<<"UTF-8-NFD: $TEXT"
  echo
  cat <<'EOF'
\end{document}
EOF
) > utest.tex

This script uses uconv (from ICU, See note 1 below) to create the two representations (NFC and NFD) of the same text and adds the XeLaTeX pre-/post-amble. This script should be "safe" to copy from the web page, since it uses the converter and the text input to it can be in any UTF-8 form. (See note 2 below for a version that does not depend on uconv.)

The created file looks like this (utest.tex):

\documentclass{article}
\usepackage{xltxtra}
\begin{document}

UTF-8-NFC: åäöüÅÄÖÜß

UTF-8-NFD: åäöüÅÄÖÜß

\end{document}

(Note: This may not yield the desired file if just copied from the web. See the warning in the question.)

The result of running this through XeLaTeX is a PDF with the text:

enter image description here

where the two lines does not look exactly the same (even apart from the label). The accents in the first line look OK, but the accents of the capital letters in the second line are vastly misaligned.

So, although XeLaTeX can handle NFD form, it may not do it properly...

If \usepackage{xltxtra} is omitted the PDF looks like:

example without the xltxtra package

which corroborates the example use of XeLaTeX in the question. Furthermore: Note that nothing at all shows up in the first row and the ß is missing on the second row. This is because the loaded fonts don't have the glyphs to render this. The xltxtra loads the package fontspec, which by default loads the font "Latin Modern". Without this only legacy fonts are loaded, which does not at all play nice with unicode text.

I have tested with different fonts (system fonts loaded with the fontspec command \setmainfont{<name of font>}). The result have been somewhat diverse. For all fonts that have the needed glyphs the first line looks correct. The second line, however, can come out in some different forms. For example with the accents after the base letters, as if they were non-combining; or with missing-glyph-boxes after the base letters...

As Khaled noted, XeTeX can normalize its input to NFC. Adding \XeTeXinputnormalization=1 to the preamble, before any non NFC-text is read, and still using \usepackage{xltxtra} and/or other means to set up proper fonts, the output is:

example with automatic NFC-normalization

This time the two lines does look exactly the same (apart from the label).

What to do?

If using XeTeX, \XeTeXinputnormalization=1 is definitely a solution. Just remember that you have to properly set up the fonts.

The other way to go, which works with all(?) programs that support UTF-8 NFC text input, is to convert the input files beforehand.

To massage the files into NFC form one can, for example, use uconv (from ICU^{See note 1 below}) as I did in the MWE-generator above.

$ uconv -o outfile.tex -f utf-8 -t utf-8 -x nfc infile.tex

(This works with UTF-16 encoding -- and others -- too. Just change the from (-f) and to (-t) options appropriately.)

Disclamer: Use this command at your own risk. Be sure to keep the original file until you can verify the result.

This should probably be safe to run on any (7-bit) ASCII or UTF-8 encoded tex file. If the file is already in NFC the conversion should not change anything, since it is idempotent. Files containing only 7-bit ascii are already in NFC, since 7-bit ASCII is a subset of UTF-8 and contains no combining characters that could make the text non-NFC.

Notes

The uconv utility from ICU is in the package libiuc-dev on my Ubuntu 12.04 64-bit.
(I think it is among the examples for the ICU4C library, but I could not find any info about the it from a quick search on the homepage. I'm a bit confused...)
As requested by David in his comment I have made a version of the MWE-generator that does not depend on uconv.
```
#!/bin/bash
(
  echo '\documentclass{article}'
  echo '\usepackage{xltxtra}'
  echo '\begin{document}'
  echo
  echo -e 'UTF-8-NFC: \xc3\xa5\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x85\xc3\x84\xc3\x96\xc3\x9c\xc3\x9f'
  echo
  echo -e 'UTF-8-NFD: \x61\xcc\x8a\x61\xcc\x88\x6f\xcc\x88\x75\xcc\x88\x41\xcc\x8a\x41\xcc\x88\x4f\xcc\x88\x55\xcc\x88\xc3\x9f'
  echo
  echo '\end{document}'
) > utest.tex
```
This version only depends on that echo -e interprets \xHH (and that echo without -e does not).

I kept the other version (above, in the main text) since it allows for easy changes in the sample text.

For the interested, the hex escapes are generated by uconv -x '[:Cc:]>; ::nfc;' <<<"$TEXT" | hexdump -v -e '/1 "%02x "' | sed -e 's/[[:xdigit:]][[:xdigit:]]/\\x\0/g; s/ //g' for NFC, &sim. for NFD.