[Tex/LaTex] The ^^ notation in various engines

luatexpdftexxetex

Basically, how does the ^^^... notation work in LuaTeX and XeTeX, exactly?

In 8-bit TeX engines (recent TeX, eTeX, pdfTeX, at least), two consecutive identical catcode 7 characters (typically ^), followed by two lowercase hexadecimal digits, are converted before the tokenization step to the corresponding byte. Namely, ^^6f is exactly equivalent to o: for instance, \sh^^6fw ^^6f will cause TeX to show the letter o.

There is also the notation with two ^ (identical catcode 7 characters), followed by any ascii character (but not two lowercase hexadecimal digits), which is replaced by the character obtained by either subtracting or adding 64 to the character code, remaining among ASCII characters (range 0 to 128).

Unicode-aware engines (I'm thinking of LuaTeX and XeTeX, there are perhaps other less known ones around) also provide ^^^^xxxx and ^^^^^xxxxx for characters whose hexadecimal representation has 4, or 5 digits. But this does not seem to be done in the same way across engines.

For instance, both LuaTeX and XeTeX appear to accept the notation with 4, 5, or 6 carets followed by the same number of hexadecimal digits, but XeTeX also accepts it for 3, while LuaTeX doesn't. Compiling the following with pdfTeX, LuaTeX, and XeTeX gives different results.

\catcode0=12
\newlinechar=10
\def\loopshow#1{\message{\meaning #1^^J}\loopshow}
\loopshow \/
  ^^56
  ^^^056
  ^^^^0056
  ^^^^^00056
  ^^^^^^000056
  ^^^^^^^0000056
{\end\iffalse}
\fi}
\bye

One weird fact about XeTeX (a bug?) is that

\show ^^^^^^010101

shows the character displaywidth.

My goal (there may be a better way to do this) is to provide a way to test whether passing a given list of tokens through \scantokens is safe. For that, my plan is to go through the \detokenized token list one character at a time, applying TeX's rule for tokenizing (but no need to fully tokenize), and detecting begin-group and end-group tokens, as well as invalid characters.

Best Answer

First I want to define some symbols and functions for a easier formalization of the answer.

Symbols:

  • x: a lowercase hex digit: 0 to 9 or a to f
  • N: not a lowercase hex digit
  • c: a seven-bit letter with character code less than 128
  • ^: a superscript character with catcode 7. The character code does not matter, but if used in a row the characters must have the same character code.

Functions:

  • hextochar(str) returns one character whose character code is given by the string argument str interpreted as hexadecimal number.
  • charcode(chr) returns the character code of the character argument chr.
  • numtochar(num) returns one character whose character code is given by the numerical argument num.

The given conversion rules for the engines are tried top to bottom until the first rule can be applied.

TeX, eTeX, pdfTeX ("The TeXbook", "Chapter 8: The Characters You Type"):

  • ^^xxhextochar(xx)
  • ^^c ⇒ if charcode(c) < 64 then numtochar(c+64) else numtochar(c-64)

LuaTeX (function process_sup_mark in textoken.w):

  • ^^^^^^xxxxxxhextochar(xxxxxx)

    • Remark: xxxxxx is not limited, but values ≥ 0x110100 will cause trouble.
    • 0 to 0x10ffff: characters in the normal Unicode range.
    • 0x110000 to 0x1100ff: special characters that are shown as bytes (last 8 bits), they are not displayed in UTF-8.
  • ^^^^xxxxhextochar(xxxx)

  • ^^xxhextochar(xx)

  • ^^c ⇒ if charcode(c) < 64 then numtochar(c+64) else numtochar(c-64)

XeTeX (xetex.ch, @<If this |sup_mark| starts an expanded character ...@>):

  • ^^^^^^xxxxxxhextochar(xxxxxx)
    only if xxxxxx0x10ffff.
  • ^^^^^xxxxxhextochar(xxxxx)
  • ^^^^xxxxhextochar(xxxx)
  • ^^^xxxhextochar(xxx)
  • ^^xxhextochar(xx)
  • ^^c ⇒ if charcode(c) < 64 then numtochar(c+64) else numtochar(c-64)

However XeTeX's implementation is not compatible to TeX. For example, if the superscript character with catcode 7 is also a hexadecimal number, then XeTeX behaves unexpected:

\documentclass{minimal}
\begin{document}
\begingroup
  \catcode`\4=7 % superscript
  [$ 4444{a} $]
\endgroup
\end{document}

In case of TeX/e-TeX/pdfTeX and LuaTeX the two superscript characters 44 are followed by two 44 hexadecimal digits, the result is letter D (character code 0x44, decimal 68) and {a} follows that gives variable "a" in math mode. LuaTeX does not see four superscript characters, because they are not followed by four hexadecimal digits.

XeTeX first sees four superscript characters. But they are not followed by four hexadecimal digits. It switches to the case ^^c, where two superscript characters are followed by a non-hexadecimal character. The result is "t" (0x74 = 0x34 ('4') + 64). The fourth "4" is then treated as superscript that raises the following {a}. But c is "4", a hexadecimal digit. XeTeX should have applied case ^^xx. Therefore I consider this behaviour as bug.

(Edit: Correction for next paragraph, 65536 is correct and 256 was wrong — I had looked at the wrong section in the web change file.)

Also the problem with \show is indeed a bug. The character is printed calling the procedure print with its character code as argument. If this code is less than biggest_char, then the character is printed, otherwise the code is interpreted as string id and the string with the id is printed instead (procedure print). The definition of biggest_char:

@d biggest_char=65536 {the largest allowed character number;
   must be |<=max_quarterword|}

Characters ≤ U+FFFF are shown correctly, beyond the characters affected are the characters beyond. This can be used for debugging the string pool ⌣:

\catcode`\{=1
\catcode`\}=2
\catcode`\#=6
\catcode`\^=7
\catcode9=10
\def\msg#{\immediate\write16}

% #1: string index
% #2: string value
\def\StringPrintEntry#1#2{%
  \msg{[#1: #2]}%
}
% #1: string index
% #2: messge
\def\StringError#1#2{%
  \msg{! [#1] #2}%
}
\def\StringPrintCount#1{%
  \msg{==> #1 strings available.}%
}
% #1: start index (zero based, including)
% #2: number of entries
\def\StringList#1#2{%
  \begingroup
    \countdef\i=11 % string index
    \countdef\m=12 % limit for string index
    \countdef\u=13 % char index
    \chardef\1=1 %
    \i=#1\relax
    \m=#2\relax
    \advance\m\i
    \u=\i
    \advance\u "10000\relax
    \StringProcess
  \endgroup
}
\def\StringProcess{
  \ifnum\i<\m
    \ifnum\u<"110000 %
      \lccode`0=\u
      \lowercase{%
        \edef\x{%
          {\the\i}%
          {\expandafter\StringStripPrefix\meaning 0}%
        }%
      }%
      \expandafter\StringProcessEntry\x
      \advance\i\1
      \advance\u\1
    \else
      \let\StringProcess\relax
      \expandafter\StringError\expandafter
      {\the\i}%
      {Index is too large.}%
    \fi
  \else
    \let\StringProcess\relax
  \fi
  \StringProcess
}
\def\StringProcessEntry#1#2{%
  \def\s{#2}%
  \ifx\s\StringInvalid
    \ifnum\i=5 % magic
      \StringPrintEntry{#1}{#2}%
    \else
      \let\StringProcess\relax
      \StringError{#1}{End of strings reached.}%
    \fi
  \else
    \StringPrintEntry{#1}{#2}%
  \fi
}
\def\StringStripPrefix#1 #2 {}
\def\StringInvalid{???}
\def\StringCount{%
  \begingroup
    \countdef\c=14 %
    \c=0 %
    \def\StringPrintEntry##1##2{%
      \advance\c\1
      \xdef\StringResult{\the\c}%
    }%
    \def\StringError##1##2{}%
    \StringList{0}{"10FFFF}
  \endgroup
  \StringPrintCount{\StringResult}%
}

\StringList{0}{40}

\StringCount

\csname @@end\endcsname\end

The result:

This is XeTeX, Version 3.1415926-2.4-0.9998 (MiKTeX 2.9) (INITEX)
(C:\Users\one\test\test-xetex-strings.tex
[0: .4]
[1: .9998]
[2: buffer size]
[3: pool size]
[4: number of strings]
[5: ???]
[6: m2d5c2l5x2v5i]
[7: End of file on the terminal!]
[8: ! ]
[9: (That makes 100 errors; please try again.)]
[10: ? ]
[11: Type <return> to proceed, S to scroll future error messages,]
[12: R to run without stopping, Q to run quietly,]
[13: I to insert something, ]
[14: E to edit your file,]
[15: 1 or ... or 9 to ignore the next 1 to 9 tokens of input,]
[16: H for help, X to quit.]
[17: OK, entering ]
[18: batchmode]
[19: nonstopmode]
[20: scrollmode]
[21: ...]
[22: insert>]
[23: I have just deleted some text, as you asked.]
[24: You can now delete more, or insert, or whatever.]
[25: Sorry, I don't know how to help in this situation.]
[26: Maybe you should try asking a human?]
[27: Sorry, I already gave what help I could...]
[28: An error might have occurred before I noticed any problems.]
[29: ``If all else fails, read the instructions.'']
[30:  (]
[31: Emergency stop]
[32: TeX capacity exceeded, sorry []
[33: If you really absolutely need more capacity,]
[34: you can ask a wizard to enlarge me.]
[35: This can't happen (]
[36: I'm broken. Please show this to someone who can fix can fix]
[37: I can't go on meeting you like this]
[38: One of your faux pas seems to have wounded me deeply...]
[39: in fact, I'm barely conscious. Please fix it and try again.]
==> 1381 strings available.
 )
No pages of output.
Transcript written on test-xetex-strings.log.
Related Question