[Tex/LaTex] Strange character changes (Xe(La)TeX or pdf(La)TeX, it’s irrelevant)

accentsunicode

A friend of mine uses pdfLaTeX for her university notes just like me. She uses the inputenc package, since on her Windows that allows characters with diacritics such as è or ò to by correctly typesetted without having to type `e or `o instead. I don't use inputenc since on my Mac that generates "Keyboard character not set up for use with LaTeX" or something like that. So the first question is: why this difference? And the second question is: why does it happen that if she sends me a tex document I see √® instead of "è" and √≤ instead of "ò"? And finally, with pdfTeX that typesets to the original characters with diacritics: why? Shouldn't it typeset to what is typed as happens with XeTeX?

Best Answer

It is a problem of encodings. From the samples you provided, i.e: your friend writes è, you see √® it can be inferred that your friend uses an editor which saves the text in UTF-8 encoding, but you are using an editor which uses MAC-ROMAN encoding instead.

Let me elaborate this a bit.

Editors and encodings

Each character you (or your friend) write in the editor, has to be coded as a sequence of bytes to store it in a file. The particular sequence of bytes which is written in the file is editor dependant.

All the editors agree in a set of characters called ASCII, for which the mapping between characters and bytes is a standard (you can find a table with all characters in this set and their codes here for example). TeX also understands this standard, so as long as you write files which use exclusively characters from this set, all editors will agree in what is shown to the user, and (La)TeX will process it without problems.

Unfortunately, ASCII estandard does not include codes for accented letters and other non-english characters. TeX includes some shorthands to introduce these kind of symbols, via sequences of ASCII chars, such as \`e.

However, most operating systems and editors allow you to directly type the letter (provided that your keyboard has dead-accent keys), and show you è, which is more convenient. The problem is when this character has to be stored on disk. Since there is no ASCII code for the character è, another encoding has to be used. An encoding is a mapping between characters and binary codes, just like ASCII. There are lots of alternative encodings, "utf-8" and "mac-roman" among them. Each one uses a different bit sequence for the è, which explains why your friend and you don't see the same letters in your editors.

In particular, UTF-8 uses two bytes for representing the è. The particular values of those bytes are (in hexadecimal) c3 and a8. Your editor uses Mac-Roman instead, and this encoding uses a single byte for each character, so those two bytes are shown as two chars. In particular, you can see here that those bytes are interpreted as follows: c3=,a8=®, which of course explains why you see √®.

inputenc

By default TeX understand only the ASCII character set, and in ASCII all codes are below the value 80 (hex), because ASCII uses only 7 bits. So TeX decides to ignore all bytes read from the source whose code is greater than or equal to 80. In particular, it will ignore c3 and a8, and then the è will not be shown in the output.

You need to explicitly ask TeX not to ignore those chars, and moreover, you need to tell TeX which encoding was using your editor. In other words, when TeX finds the sequence c3, a8, should he produce the character è or should it produce characters √® instead? TeX has no way to decide, unless you tell him that your editor used UTF-8 (and thus TeX will produce è) or Mac-Roman (and thus TeX will produce √®).

This is what package inputenc is for. Loading this package you are telling TeX the encoding used for the editor which saved the file, so that TeX can render its contents properly.

So?

So you are now equipped to understand that you have two different problems, and their solutions:

  1. You have to tell TeX that the encoding of the file is "utf8". You do that with \usepackage[utf8]{inputenc}. Presumibly, this line was already in the file generated by your friend. Otherwise he would produce also a bad PDF when compiling the document.

  2. You want to see è instead of √® in your editor. For this problem you have two solutions:

    • Tell your editor that he should use utf-8 instead of Mac-Roman. I can't tell you how to do that, since I don't know which editor are you using, but any decent editor should allow you to specify whatever encoding you want. This is the preferred solution, since then your friend and you will be using the same encoding, and this also will be the encoding specified to inputenc.

    • Recode the files coming from your friend to your Mac-Roman encoding. You can use tools such as iconv or recode to do so. But this is a bad solution, because then you should also change inputenc to specify the new encoding (otherwise TeX will render wrong characters).

XeTeX

XeTeX and LuaTeX use utf-8 natively. This means that they expect the source file to use utf-8 to encode any non-ascii char. So inputenc is not neccesary.

But this also means that XeTeX and LuaTeX do not allow any other encoding. In particular they don't allow your Mac-Roman encoding. So to use XeTeX you have to configure your editor to save in utf-8 format. Which leads us again to the same "best solution" (in this case the only solution).

Your questions

I don't use inputenc since on my Mac that generates "Keyboard character not set up for use with LaTeX" or something like that. So the first question is: why this difference?

Presumably you used in your Mac \usepackage[utf8]{inputenc}, copying from a file of your friend. But then you wrote è and saved the file in Mac-Roman, which is not the encoding declared in inputenc. So TeX complains, or get the characters wrong.

In particular, when you write è your editor saves one byte of value 8f (as it can be seen here). But that value is not a valid utf-8 sequence, hence the TeX error.

Funny fact: if casually you would have saved a valid utf-8 sequence (for example, if you typed the formula $√∂$) TeX would not complain, but you'll get a unexpected result (ö). This particular input (√∂) is coded in Mac-Roman by the sequence c3,b6 which casually is valid utf8 and represents ö.

And the second question is: why does it happen that if she sends me a tex document I see √® instead of "è" and √≤ instead of "ò"?

This was already answered, I hope.

And finally, with pdfTeX that typesets to the original characters with diacritics: why?

Because the file contains \usepackage[utf8]{inputenc}, and indeed it does contain utf8 codes. Despite the fact that your editor shows these characters wrong, because it ignores that the file is utf8, TeX gets them right thanks to inputenc declaration.

Shouldn't it typeset to what is typed as happens with XeTeX?

Neither pdfTeX, nor XeTeX, nor the editor know "what was typed". All they know is "what was saved", and to interpret that they need to know the encoding in which it was saved. XeTeX assumes blindly it was utf-8. pdfTeX gets the information from the inputenc declaration. The editor needs the user to set the proper settings.

Related Question