[Tex/LaTex] Why does inputenc abandon so quickly under “utf8 based engines”

font-encodingsinput-encodingsxetex

Why do I need to do the extra work starting with \ifdefined in order to get my French guillemets correct in the pdf output, when using xelatex with a source specifying the use of T1-encoded fonts ?

\documentclass[french]{article}

    \usepackage[T1]{fontenc}
    \usepackage[utf8]{inputenc}

\ifdefined\XeTeXinterchartoks
     \catcode`« \active
     \catcode`» \active
     \def«{\char19 }
     \def»{\char20 }% ça marche, même avec Babel+frenchb
\fi

\usepackage{newtxtext}

\usepackage{babel}
\frenchbsetup{og=«, fg=»}

\begin{document}

\showboxbreadth\maxdimen
\showboxdepth\maxdimen
\showoutput

«coucou»
\end{document}

The log contains:

Package: inputenc 2015/03/17 v1.2c Input encoding file
\inpenc@prehook=\toks14
\inpenc@posthook=\toks15


Package inputenc Warning: inputenc package ignored with utf8 based engines.

But it is loaded after fontenc. It is not forbidden to use fontenc with xelatex. inputenc is loaded after it. Thus it should know that T1-encoded font slots are to be used. Why then doesn't it do the job of making these characters active and map them to the suitable \char xx slots ?

There is something escaping me here…

Notice that the code sample also uses babel+frenchb which adds automatic spacing. It seems not to have been perturbed from my making the characters active.

In order to explain more the issue, consider the following input:

\documentclass{article}

    \usepackage[T1]{fontenc}
    \usepackage[utf8]{inputenc}

\begin{document}

\showboxbreadth\maxdimen
\showboxdepth\maxdimen
\showoutput

«coucou»

\end{document}

It produces, if compiled with xelatex:

The explanation is simple: the ascii chars « and » are in slots 171 and 187 respectively. Hence the corresponding glyphs from the T1 encoding are used, giving the result. inputenc does nothing, but it could have donc something akin to my code above.

...\hbox(6.63332+0.0)x345.0, glue set 290.00977fil
....\hbox(0.0+0.0)x15.0
....\T1/cmr/m/n/10 «
....\T1/cmr/m/n/10 c
....\T1/cmr/m/n/10 o
....\T1/cmr/m/n/10 u
....\T1/cmr/m/n/10 c
....\T1/cmr/m/n/10 o
....\T1/cmr/m/n/10 u
....\T1/cmr/m/n/10 »

Best Answer

inputenc is abandoned because it does absolutely nothing with XeTeX or LuaLaTeX. Better said, it would do bad!

See fontenc vs inputenc

Essentially, the task performed by inputenc is translating input characters into their LICR form. With an 8 bit engine, « is two byte long and inputenc is able to translate them into \guillemotleft and » into \guillemotright. But for doing so it must make some characters active. Which is exactly what you do later on, and inputenc is not instructed to do, because it's thought for an 8 bit engine.

I added a friendlier interface with newunicodechar.

\documentclass[french]{article}

\usepackage[T1]{fontenc}
\usepackage{newunicodechar}

\newunicodechar{«}{\guillemotleft}
\newunicodechar{»}{\guillemotright}

\usepackage{newtxtext}

\usepackage{babel}
\frenchbsetup{og=«, fg=»}

\begin{document}

«coucou»

\end{document}

If your aim is to provide translations for the characters in t1enc.dfu, then you can use it in a different way.

\documentclass[french]{article}

\usepackage[T1]{fontenc}
\usepackage{newunicodechar}

\newcommand\DeclareUnicodeCharacter[2]{%
  \expandafter\newunicodechar\Uchar"#1{#2}%
}
\input{t1enc.dfu}

\usepackage{newtxtext}

\usepackage{babel}
\frenchbsetup{og=«, fg=»}

\begin{document}

«coucou»

\end{document}

A proof of concept for a package `xeinputenc`

\ProvidesPackage{xeinputenc}[2015/12/12]
\RequirePackage{newunicodechar}

\AtBeginDocument{\xeinputenc@process}

\newcommand{\xeinputenc@process}{%
  \begingroup
  \gdef\xeinputenc@list{}%
  \def\cdp@elt##1##2##3##4{%
    \g@addto@macro\xeinputenc@list{\lowercase{\xeinputenc@input{##1}}}%
  }%
  \cdp@list
  \aftergroup\xeinputenc@list
  \endgroup
}

\newcommand{\DeclareUnicodeCharacter}[2]{%
  \expandafter\newunicodechar\Uchar"#1{#2}%
}

\newcommand{\xeinputenc@input}[1]{%
  \InputIfFileExists
    {#1enc.dfu}
    {\wlog{... processing UTF-8 mapping file for font encoding #1}\catcode`\ 9\relax}%
    {\wlog{... no UTF-8 mapping file for font encoding #1}}%
}


\@onlypreamble\DeclareUnicodeCharacter
\@onlypreamble\xeinputenc@list
\@onlypreamble\xeinputenc@process
\@onlypreamble\xeinputenc@input
\endinput

Now your test document can be

\documentclass[french]{article}

\usepackage{xeinputenc}

\usepackage{newtxtext}

\usepackage{babel}
\frenchbsetup{og=«, fg=»}

\begin{document}

«coucou»

\end{document}

No explicit loading of fontenc is needed in this case, because this is already taken care of by newtxtext, but calls to it will be honored.

Related Solutions

[Tex/LaTex] utf8x vs. utf8 (inputenc)

The simple answer is that utf8x is to be avoided if possible. It loads the ucs package, which for a long time was unmaintained (although there is now a new maintainer) and breaks various other things.

See egreg's answer to this question as well, which outlines how to get extra characters using the [utf8] option of inputenc.

Generally, however, the best way to deal with Unicode source (especially with non-latin scripts) is really XeLaTeX or LuaLaTeX.

There's an extended discussion of this here: Encoding remarks. See especially the comments by Philipp Lehman and Philipp Stephani.

[Tex/LaTex] LaTeX does not print words correctly: inputenc/fontenc problem

Don't use the ﬁ and ﬂ characters in the input, but write firms and fleets.

Also add the following "magic" line at the beginning of your file

% !TEX encoding = UTF-8 Unicode

This will ensure that TeXShop interprets your file as UTF-8.

If your text has already many instances of ﬁ and ﬂ, you can consider adding the following to your preamble:

\usepackage{newunicodechar}
\newunicodechar{ﬁ}{fi}
\newunicodechar{ﬂ}{fl}

but it's best to stick with normal input.

Accented characters will be treated correctly.

Here's an example:

% !TEX encoding = UTF-8 Unicode
\documentclass[a4paper,12pt]{article}
\linespread{1.5}
\usepackage[francais,english]{babel} 
\usepackage[utf8]{inputenc} 
\usepackage[T1]{fontenc} 
\usepackage[round]{natbib}
\usepackage{epigraph}
\usepackage{makeidx}
\usepackage{url}
\usepackage{color}
\usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref}
\usepackage[nottoc]{tocbibind}
\setcounter{tocdepth}{12}
\usepackage{eurosym}

\usepackage{newunicodechar}
\newunicodechar{ﬁ}{fi}
\newunicodechar{ﬂ}{fl}

\usepackage{ragged2e}

\begin{document}

 These earlier firms, were far more powerful; they commanded armies and fleets

 These earlier ﬁrms, were far more powerful; they commanded armies and ﬂeets

 Garçon, été, l'Hôpital, Génève

\bibliographystyle{plainnat}
\bibliography{biblio.bib}
\printindex
\end{document}

enter image description here

Best Answer

A proof of concept for a package xeinputenc

Related Solutions

[Tex/LaTex] utf8x vs. utf8 (inputenc)

[Tex/LaTex] LaTeX does not print words correctly: inputenc/fontenc problem

Related Question

A proof of concept for a package `xeinputenc`