Basically, how does the ^^^...
notation work in LuaTeX and XeTeX, exactly?
In 8-bit TeX engines (recent TeX, eTeX, pdfTeX, at least), two consecutive identical catcode 7 characters (typically ^
), followed by two lowercase hexadecimal digits, are converted before the tokenization step to the corresponding byte. Namely, ^^6f
is exactly equivalent to o
: for instance, \sh^^6fw ^^6f
will cause TeX to show the letter o
.
There is also the notation with two ^
(identical catcode 7 characters), followed by any ascii character (but not two lowercase hexadecimal digits), which is replaced by the character obtained by either subtracting or adding 64 to the character code, remaining among ASCII characters (range 0 to 128).
Unicode-aware engines (I'm thinking of LuaTeX and XeTeX, there are perhaps other less known ones around) also provide ^^^^xxxx
and ^^^^^xxxxx
for characters whose hexadecimal representation has 4, or 5 digits. But this does not seem to be done in the same way across engines.
For instance, both LuaTeX and XeTeX appear to accept the notation with 4, 5, or 6 carets followed by the same number of hexadecimal digits, but XeTeX also accepts it for 3, while LuaTeX doesn't. Compiling the following with pdfTeX, LuaTeX, and XeTeX gives different results.
\catcode0=12
\newlinechar=10
\def\loopshow#1{\message{\meaning #1^^J}\loopshow}
\loopshow \/
^^56
^^^056
^^^^0056
^^^^^00056
^^^^^^000056
^^^^^^^0000056
{\end\iffalse}
\fi}
\bye
One weird fact about XeTeX (a bug?) is that
\show ^^^^^^010101
shows the character displaywidth
.
My goal (there may be a better way to do this) is to provide a way to test whether passing a given list of tokens through \scantokens
is safe. For that, my plan is to go through the \detokenized
token list one character at a time, applying TeX's rule for tokenizing (but no need to fully tokenize), and detecting begin-group and end-group tokens, as well as invalid characters.
Best Answer
First I want to define some symbols and functions for a easier formalization of the answer.
Symbols:
x
: a lowercase hex digit:0
to9
ora
tof
N
: not a lowercase hex digitc
: a seven-bit letter with character code less than 128^
: a superscript character with catcode 7. The character code does not matter, but if used in a row the characters must have the same character code.Functions:
hextochar
(str) returns one character whose character code is given by the string argument str interpreted as hexadecimal number.charcode
(chr) returns the character code of the character argument chr.numtochar
(num) returns one character whose character code is given by the numerical argument num.The given conversion rules for the engines are tried top to bottom until the first rule can be applied.
TeX, eTeX, pdfTeX ("The TeXbook", "Chapter 8: The Characters You Type"):
^^xx
⇒hextochar
(xx
)^^c
⇒ ifcharcode
(c
) < 64 thennumtochar
(c
+64) elsenumtochar
(c
-64)LuaTeX (function
process_sup_mark
intextoken.w
):^^^^^^xxxxxx
⇒hextochar
(xxxxxx
)xxxxxx
is not limited, but values ≥0x110100
will cause trouble.0x10ffff
: characters in the normal Unicode range.0x110000
to0x1100ff
: special characters that are shown as bytes (last 8 bits), they are not displayed in UTF-8.^^^^xxxx
⇒hextochar
(xxxx
)^^xx
⇒hextochar
(xx
)^^c
⇒ ifcharcode
(c
) < 64 thennumtochar
(c
+64) elsenumtochar
(c
-64)XeTeX (
xetex.ch
, @<If this |sup_mark| starts an expanded character ...@>):^^^^^^xxxxxx
⇒hextochar
(xxxxxx
)only if
xxxxxx
≤0x10ffff
.^^^^^xxxxx
⇒hextochar
(xxxxx
)^^^^xxxx
⇒hextochar
(xxxx
)^^^xxx
⇒hextochar
(xxx
)^^xx
⇒hextochar
(xx
)^^c
⇒ ifcharcode
(c
) < 64 thennumtochar
(c
+64) elsenumtochar
(c
-64)However XeTeX's implementation is not compatible to TeX. For example, if the superscript character with catcode 7 is also a hexadecimal number, then XeTeX behaves unexpected:
In case of TeX/e-TeX/pdfTeX and LuaTeX the two superscript characters
44
are followed by two44
hexadecimal digits, the result is letter D (character code 0x44, decimal 68) and{a}
follows that gives variable "a" in math mode. LuaTeX does not see four superscript characters, because they are not followed by four hexadecimal digits.XeTeX first sees four superscript characters. But they are not followed by four hexadecimal digits. It switches to the case
^^c
, where two superscript characters are followed by a non-hexadecimal character. The result is "t" (0x74 = 0x34 ('4') + 64). The fourth "4" is then treated as superscript that raises the following{a}
. Butc
is "4", a hexadecimal digit. XeTeX should have applied case^^xx
. Therefore I consider this behaviour as bug.(Edit: Correction for next paragraph, 65536 is correct and 256 was wrong — I had looked at the wrong section in the web change file.)
Also the problem with
\show
is indeed a bug. The character is printed calling the procedureprint
with its character code as argument. If this code is less thanbiggest_char
, then the character is printed, otherwise the code is interpreted as string id and the string with the id is printed instead (procedureprint
). The definition ofbiggest_char
:Characters ≤ U+FFFF are shown correctly, beyond the characters affected are the characters beyond. This can be used for debugging the string pool ⌣:
The result: