A friend of mine uses pdfLaTeX for her university notes just like me. She uses the inputenc package, since on her Windows that allows characters with diacritics such as è or ò to by correctly typesetted without having to type `e or `o instead. I don't use inputenc since on my Mac that generates "Keyboard character not set up for use with LaTeX" or something like that. So the first question is: why this difference? And the second question is: why does it happen that if she sends me a tex document I see √® instead of "è" and √≤ instead of "ò"? And finally, with pdfTeX that typesets to the original characters with diacritics: why? Shouldn't it typeset to what is typed as happens with XeTeX?
[Tex/LaTex] Strange character changes (Xe(La)TeX or pdf(La)TeX, it’s irrelevant)
accentsunicode
Best Answer
It is a problem of encodings. From the samples you provided, i.e: your friend writes
è
, you seeè
it can be inferred that your friend uses an editor which saves the text in UTF-8 encoding, but you are using an editor which uses MAC-ROMAN encoding instead.Let me elaborate this a bit.
Editors and encodings
Each character you (or your friend) write in the editor, has to be coded as a sequence of bytes to store it in a file. The particular sequence of bytes which is written in the file is editor dependant.
All the editors agree in a set of characters called ASCII, for which the mapping between characters and bytes is a standard (you can find a table with all characters in this set and their codes here for example). TeX also understands this standard, so as long as you write files which use exclusively characters from this set, all editors will agree in what is shown to the user, and (La)TeX will process it without problems.
Unfortunately, ASCII estandard does not include codes for accented letters and other non-english characters. TeX includes some shorthands to introduce these kind of symbols, via sequences of ASCII chars, such as \`e.
However, most operating systems and editors allow you to directly type the letter (provided that your keyboard has dead-accent keys), and show you
è
, which is more convenient. The problem is when this character has to be stored on disk. Since there is no ASCII code for the characterè
, another encoding has to be used. An encoding is a mapping between characters and binary codes, just like ASCII. There are lots of alternative encodings, "utf-8" and "mac-roman" among them. Each one uses a different bit sequence for theè
, which explains why your friend and you don't see the same letters in your editors.In particular, UTF-8 uses two bytes for representing the
è
. The particular values of those bytes are (in hexadecimal)c3
anda8
. Your editor uses Mac-Roman instead, and this encoding uses a single byte for each character, so those two bytes are shown as two chars. In particular, you can see here that those bytes are interpreted as follows:c3
=√
,a8
=®
, which of course explains why you seeè
.inputenc
By default TeX understand only the ASCII character set, and in ASCII all codes are below the value
80
(hex), because ASCII uses only 7 bits. So TeX decides to ignore all bytes read from the source whose code is greater than or equal to80
. In particular, it will ignorec3
anda8
, and then theè
will not be shown in the output.You need to explicitly ask TeX not to ignore those chars, and moreover, you need to tell TeX which encoding was using your editor. In other words, when TeX finds the sequence
c3
,a8
, should he produce the characterè
or should it produce charactersè
instead? TeX has no way to decide, unless you tell him that your editor used UTF-8 (and thus TeX will produceè
) or Mac-Roman (and thus TeX will produceè
).This is what package
inputenc
is for. Loading this package you are telling TeX the encoding used for the editor which saved the file, so that TeX can render its contents properly.So?
So you are now equipped to understand that you have two different problems, and their solutions:
You have to tell TeX that the encoding of the file is "utf8". You do that with
\usepackage[utf8]{inputenc}
. Presumibly, this line was already in the file generated by your friend. Otherwise he would produce also a bad PDF when compiling the document.You want to see
è
instead ofè
in your editor. For this problem you have two solutions:Tell your editor that he should use utf-8 instead of Mac-Roman. I can't tell you how to do that, since I don't know which editor are you using, but any decent editor should allow you to specify whatever encoding you want. This is the preferred solution, since then your friend and you will be using the same encoding, and this also will be the encoding specified to
inputenc
.Recode the files coming from your friend to your Mac-Roman encoding. You can use tools such as
iconv
orrecode
to do so. But this is a bad solution, because then you should also changeinputenc
to specify the new encoding (otherwise TeX will render wrong characters).XeTeX
XeTeX and LuaTeX use utf-8 natively. This means that they expect the source file to use utf-8 to encode any non-ascii char. So
inputenc
is not neccesary.But this also means that XeTeX and LuaTeX do not allow any other encoding. In particular they don't allow your Mac-Roman encoding. So to use XeTeX you have to configure your editor to save in utf-8 format. Which leads us again to the same "best solution" (in this case the only solution).
Your questions
Presumably you used in your Mac
\usepackage[utf8]{inputenc}
, copying from a file of your friend. But then you wroteè
and saved the file in Mac-Roman, which is not the encoding declared ininputenc
. So TeX complains, or get the characters wrong.In particular, when you write
è
your editor saves one byte of value8f
(as it can be seen here). But that value is not a valid utf-8 sequence, hence the TeX error.Funny fact: if casually you would have saved a valid utf-8 sequence (for example, if you typed the formula
$ö$
) TeX would not complain, but you'll get a unexpected result (ö
). This particular input (ö
) is coded in Mac-Roman by the sequencec3
,b6
which casually is valid utf8 and representsö
.This was already answered, I hope.
Because the file contains
\usepackage[utf8]{inputenc}
, and indeed it does contain utf8 codes. Despite the fact that your editor shows these characters wrong, because it ignores that the file is utf8, TeX gets them right thanks toinputenc
declaration.Neither pdfTeX, nor XeTeX, nor the editor know "what was typed". All they know is "what was saved", and to interpret that they need to know the encoding in which it was saved. XeTeX assumes blindly it was utf-8. pdfTeX gets the information from the
inputenc
declaration. The editor needs the user to set the proper settings.