[Tex/LaTex] How to (La)TeX read UTF-8

unicode

As described in The TeXbook, TeX reads files byte by byte, regardless of the particular format — as I understand, this is just how IniTeX is set up.

I also understand that LaTeX is just a collection of macros built on top of IniTeX, described in most distributions of TeX by the file latex.ltx.

The above two things are at odds with my understanding of LaTeX's ability to read UTF-8. I was under the impression that reading the input byte by byte (and thus for instance, only being able to access numbers from 0 to 255 using \char or something) was baked into TeX, and thus would exist in all variants built on top of it.

Thus, how is LaTeX able to do this?

Best Answer

If you want to know how the 8-bit engines handle utf8 input you can use \tracingmacros:

\documentclass{article}

\begin{document}
{\tracingmacros =1 ä }
\end{document}

which gives

Ã->\UTFviii@two@octets Ã

\UTFviii@two@octets #1#2->\expandafter \UTFviii@defined \csname u8:#1\string #2
\endcsname 
#1<-Ã
#2<-¤

\UTFviii@defined #1->\ifx #1\relax \if \relax \expandafter \UTFviii@checkseq \s
tring #1\relax \relax \UTFviii@undefined@err {#1}\else \PackageError {inputenc}
{Invalid UTF-8 byte sequence}\UTFviii@invalid@help \fi \else \expandafter #1\fi

#1<-\u8:ä 

\u8:ä ->\IeC {\"a}

That means the the first byte of the ä (the Ã) is an active char, a command which then picks up the next byte and then calls \u8:ä which calls \"a. In this way (pdf)latex can handle quite a lot of utf8 input but it has e.g. problems with "char + combining accent" as there is no sensible code for the combining accent to go back to add an accent on the char.

Related Question