It is indeed true that XeTeX can produce invalid UTF-8 in its error output, and I can reproduce this with the following simpler .tex
file:
\documentclass{article}
\begin{document}
应该把 123456789 123456789 123 \textwidth换成
\end{document}
So you can consider this either a bug in XeTeX (for producing invalid UTF-8) or in Pandoc (for incorrectly assuming that XeTeX will produce valid UTF-8).
Unicode and UTF-8
The problem, in short, is that you cannot just break a sequence of UTF-8 bytes in any arbitrary place. To take an example, in the string 应该把
, the characters are:
So the string as a whole is encoded in UTF-8 as a sequence of 9 bytes:
E5 BA 94 E8 AF A5 E6 8A 8A
\______/ \______/ \______/
应 该 把
You can break the byte sequence after 0, 3, 6, or 9 bytes to get a valid string containing 0, 1, 2 or 3 characters respectively. But breaking it at some other place results in invalid UTF-8.
Unfortunately, that is exactly what XeTeX can do: it can break the byte sequence in some such place, resulting in invalid UTF-8 that Pandoc then fails to cope with (because it assumes valid UTF-8).
Explanation
In the first place, in Unicode-aware engines like XeTeX and LuaTeX, all unicode characters can be part of control sequences, and there happens to be no control sequence named \textwidth换成
so the system generates an error about an undefined control sequence.
Then when printing out this error to the terminal, TeX tries to add additional context around where this undefined control sequence \textwidth换成
was encountered, which means some additional characters surrounding the occurrence, to fill error_line
characters. (This can be increased; see here and here. Though increasing this is a good idea anyway and decreases the likelihood of this error happening; it can still happen with sufficiently long lines (and does happen with the example in the question), because the max value of error_line
is still only 254.)
Unfortunately (and this is the bug), it appears that XeTeX counts by bytes and truncates the output without regard for breaking only at well-defined Unicode code-point sequences. Look for procedure show_context
in the XeTeX source code, and compare with print_valid_utf8
in the LuaTeX source code, used in its show_context
.
In this example, XeTeX picks up only the last two bytes of the first word (the 8A 8A
), which is not a valid UTF-8 sequence. That is why iconv and Pandoc complain.
Demonstration
The commands I used for compiling the above .tex
file with LuaTeX and XeTeX are respectively:
lualatex -interaction=nonstopmode test.tex | iconv -f UTF8
and
xelatex -interaction=nonstopmode test.tex | iconv -f UTF8
With the former (LuaTeX), I get the error message:
! Undefined control sequence.
l.3 ...把 123456789 123456789 123 \textwidth换成
but with the latter (XeTeX), I get an error message that is not valid UTF-8, so iconv
fails with
iconv: (stdin):11:7: cannot convert
Without iconv
, on my terminal I see printed:
! Undefined control sequence.
l.3 ...?? 123456789 123456789 123 \textwidth换成
and by redirecting the output to a file and viewing it in a raw editor, we can see better what's going on. The following is hexdump output from xxd -g 1 -c 32
:
000001c0: 78 29 0a 21 20 55 6e 64 65 66 69 6e 65 64 20 63 6f 6e 74 72 6f 6c 20 73 65 71 75 65 6e 63 65 2e x).! Undefined control sequence.
000001e0: 0a 6c 2e 33 20 2e 2e 2e 8a 8a 20 31 32 33 34 35 36 37 38 39 20 31 32 33 34 35 36 37 38 39 20 31 .l.3 ..... 123456789 123456789 1
00000200: 32 33 20 5c 74 65 78 74 77 69 64 74 68 e6 8d a2 e6 88 90 0a 20 20 20 20 20 20 20 20 20 20 20 20 23 \textwidth.......
Note the 8a 8a
(the last two bytes of 把
= E6 8A 8A
) just after the ellipsis (2e 2e 2e
meaning ...
).
Best Answer
Pandoc converts the file correctly. If you open the resulting file with a text editor you see the ⇒ symbol, which in turn means that pandoc has done a good job.
The point is that the text font you use (probably Latin Modern) does not contain the ⇒ character. If you change the font to a different one which contains the ⇒ symbol, e.g. iwona, it will appear as expected.
If you want to keep using Latin Modern as bodyfont, here's a small hack which takes the ⇒ symbol from the math font instead (it's a plain TeX solution, maybe LaTeX provides some nice abstraction around this): Place the following code into
fixRightarrow.tex
:and call pandoc with the
--include-before-body
argument: