[Tex/LaTex] producing pdf from markdown with pandoc and xelatex generate misleading error messages

pandocunicodexetex

I encounter a weird issue when I tried to convert a markdown file to pdf using pandoc. My markdown file contains Chinese characters and English characters. The command I use is:

pandoc --pdf-engine=xelatex -V CJKmainfont=KaiTi test.md -o test.pdf

The error message is:

Error producing PDF.
! Undefined control sequence.
pandoc: Cannot decode byte '\xbd': Data.Text.Internal.Encoding.streamDecodeUtf8With: Invalid UTF-8 stream

In fact, the error has nothing to do with UTF-8 encoding. After long hours of wrestling with the problem. Finally I find that it is because my markdown file contains backslashes followed by text, which are taken as LaTeX command by pandoc in default settings. After knowing this critical info, I was able to finally fix this problem. More information can be found in this pandoc issue .

Someone suggest in that issue this may be a problem with xelatex, because if we use
pandoc --pdf-engine=lualatex test.md -o test.pdf
The error message becomes something like the this:

Error producing PDF.
! Undefined control sequence. l.416
…宽度有问题，应该把\textwidth换成

If the error message from using xelatex engine is similar to above message. I would have solved this problem long long ago. So it appears to me that the error message may indeed be related to xelatex.

But, but, if we separate the pdf-generating step into two steps, i.e., first generate tex file, then generate pdf file from tex. Something like the following code:

pandoc -s -t latex -V CJKmainfont=KaiTi test.md -o test.tex # first step xelatex test.tex # second step
Then the error message will change and be just like when we use lualatex engine. This suggests that the problem may not be related to xelatex. We get contradictory conclusions.

I am new to pandoc and do not know any internals of xelatex. Can anyone more knowledgeable point out which is causing the problem here. Pandoc or xelatex or both?

system and pandoc version info

I have tested the file on both Windows and Linux system (CentOS 7). The exact version of system, pandoc, TeX Live and xelatex is list below.

Windows

system version: Windows 8.1 32bit
Pandoc version: 2.0.5
TeX Live: 2016/W32TeX
xelatex: XeTeX 3.14159265-2.6-0.99996

Linux

system version: CentOS 7.2.1511
Pandoc version: 1.12.3.1
TeX Live: 2017
xelatex: 3.14159265-2.6-0.99998

update 2017.12.29
With the release of Pandoc 2.0.6, this behaviour is handled more properly:

Allow lenient decoding of latex error logs, which are not always properly UTF8-encoded

Now, it is easier to debug this kind of issues.

Best Answer

It is indeed true that XeTeX can produce invalid UTF-8 in its error output, and I can reproduce this with the following simpler .tex file:

\documentclass{article}
\begin{document}
应该把 123456789 123456789 123 \textwidth换成
\end{document}

So you can consider this either a bug in XeTeX (for producing invalid UTF-8) or in Pandoc (for incorrectly assuming that XeTeX will produce valid UTF-8).

Unicode and UTF-8

The problem, in short, is that you cannot just break a sequence of UTF-8 bytes in any arbitrary place. To take an example, in the string 应该把, the characters are:

U+5E94 CJK UNIFIED IDEOGRAPH-5E94, encoded in UTF-8 as E5 BA 94
U+8BE5 CJK UNIFIED IDEOGRAPH-8BE5, encoded in UTF-8 as E8 AF A5
U+628A CJK UNIFIED IDEOGRAPH-628A, encoded in UTF-8 as E6 8A 8A

So the string as a whole is encoded in UTF-8 as a sequence of 9 bytes:

E5 BA 94 E8 AF A5 E6 8A 8A
\______/ \______/ \______/
   应       该       把

You can break the byte sequence after 0, 3, 6, or 9 bytes to get a valid string containing 0, 1, 2 or 3 characters respectively. But breaking it at some other place results in invalid UTF-8.

Unfortunately, that is exactly what XeTeX can do: it can break the byte sequence in some such place, resulting in invalid UTF-8 that Pandoc then fails to cope with (because it assumes valid UTF-8).

Explanation

In the first place, in Unicode-aware engines like XeTeX and LuaTeX, all unicode characters can be part of control sequences, and there happens to be no control sequence named \textwidth换成 so the system generates an error about an undefined control sequence.

Then when printing out this error to the terminal, TeX tries to add additional context around where this undefined control sequence \textwidth换成 was encountered, which means some additional characters surrounding the occurrence, to fill error_line characters. (This can be increased; see here and here. Though increasing this is a good idea anyway and decreases the likelihood of this error happening; it can still happen with sufficiently long lines (and does happen with the example in the question), because the max value of error_line is still only 254.)

Unfortunately (and this is the bug), it appears that XeTeX counts by bytes and truncates the output without regard for breaking only at well-defined Unicode code-point sequences. Look for procedure show_context in the XeTeX source code, and compare with print_valid_utf8 in the LuaTeX source code, used in its show_context.

In this example, XeTeX picks up only the last two bytes of the first word (the 8A 8A), which is not a valid UTF-8 sequence. That is why iconv and Pandoc complain.

Demonstration

The commands I used for compiling the above .tex file with LuaTeX and XeTeX are respectively:

lualatex -interaction=nonstopmode test.tex | iconv -f UTF8

and

xelatex -interaction=nonstopmode test.tex | iconv -f UTF8

With the former (LuaTeX), I get the error message:

! Undefined control sequence.
l.3 ...把 123456789 123456789 123 \textwidth换成

but with the latter (XeTeX), I get an error message that is not valid UTF-8, so iconv fails with

iconv: (stdin):11:7: cannot convert

Without iconv, on my terminal I see printed:

! Undefined control sequence.
l.3 ...?? 123456789 123456789 123 \textwidth换成

and by redirecting the output to a file and viewing it in a raw editor, we can see better what's going on. The following is hexdump output from xxd -g 1 -c 32:

000001c0: 78 29 0a 21 20 55 6e 64 65 66 69 6e 65 64 20 63 6f 6e 74 72 6f 6c 20 73 65 71 75 65 6e 63 65 2e  x).! Undefined control sequence.
000001e0: 0a 6c 2e 33 20 2e 2e 2e 8a 8a 20 31 32 33 34 35 36 37 38 39 20 31 32 33 34 35 36 37 38 39 20 31  .l.3 ..... 123456789 123456789 1
00000200: 32 33 20 5c 74 65 78 74 77 69 64 74 68 e6 8d a2 e6 88 90 0a 20 20 20 20 20 20 20 20 20 20 20 20  23 \textwidth.......

Note the 8a 8a (the last two bytes of 把 = E6 8A 8A) just after the ellipsis (2e 2e 2e meaning ...).