Based on Ulrike's answer, here is one way to invoke xindy
to get it to sort .idx
files created by Xe/LuaLaTeX. The trick is to use xindy
directly (instead of texindy
) and pass the -C utf8
flag.
Minimal Example
\documentclass{article}
\usepackage{luatextra}
\usepackage{makeidx}
\makeindex
\begin{document}
üäö
start
\index{a}\index{b}\index{ä}\index{ü}
end
\printindex
\end{document}
Compilation
lualatex filename.tex
xindy -M texindy -C utf8 -L german-duden filename.idx
lualatex filename.tex
In (pdf)latex
you can use UTF-8 encoding and xindy in the following way:
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{makeidx}
\makeindex
\begin{document}
start
\index{a}\index{b}\index{ä}\index{ü}
end
\printindex
\end{document}
And then simply run texindy -L ⟨language⟩ ⟨filename⟩.idx
.
In LuaTeX you can also use the luainputenc
package to use legacy encodings.
\documentclass{article}
\usepackage{fontspec}
\usepackage[utf8]{luainputenc}
\usepackage{makeidx}
\makeindex
\begin{document}
start
\index{a}\index{b}\index{ä}\index{ü}
end
\printindex
\end{document}
Again, run texindy -L ⟨language⟩ ⟨filename⟩.idx
.
Here the result for both examples:
First I want to define some symbols and functions for a easier formalization of
the answer.
Symbols:
x
: a lowercase hex digit: 0
to 9
or a
to f
N
: not a lowercase hex digit
c
: a seven-bit letter with character code less than 128
^
: a superscript character with catcode 7. The character code does not matter,
but if used in a row the characters must have the same character code.
Functions:
hextochar
(str) returns one character whose character code is given by the string argument str interpreted as hexadecimal number.
charcode
(chr) returns the character code of the character argument chr.
numtochar
(num) returns one character whose character code is given by the numerical argument num.
The given conversion rules for the engines are tried top to bottom until
the first rule can be applied.
TeX, eTeX, pdfTeX ("The TeXbook", "Chapter 8: The Characters You Type"):
^^xx
⇒ hextochar
(xx
)
^^c
⇒ if charcode
(c
) < 64 then numtochar
(c
+64) else numtochar
(c
-64)
LuaTeX (function process_sup_mark
in textoken.w
):
^^^^^^xxxxxx
⇒ hextochar
(xxxxxx
)
- Remark:
xxxxxx
is not limited, but values ≥ 0x110100
will cause trouble.
- 0 to
0x10ffff
: characters in the normal Unicode range.
0x110000
to 0x1100ff
: special characters that are shown as bytes (last 8 bits),
they are not displayed in UTF-8.
^^^^xxxx
⇒ hextochar
(xxxx
)
^^xx
⇒ hextochar
(xx
)
^^c
⇒ if charcode
(c
) < 64 then numtochar
(c
+64) else numtochar
(c
-64)
XeTeX (xetex.ch
, @<If this |sup_mark| starts an expanded character ...@>):
^^^^^^xxxxxx
⇒ hextochar
(xxxxxx
)
only if xxxxxx
≤ 0x10ffff
.
^^^^^xxxxx
⇒ hextochar
(xxxxx
)
^^^^xxxx
⇒ hextochar
(xxxx
)
^^^xxx
⇒ hextochar
(xxx
)
^^xx
⇒ hextochar
(xx
)
^^c
⇒ if charcode
(c
) < 64 then numtochar
(c
+64) else numtochar
(c
-64)
However XeTeX's implementation is not compatible to TeX. For example, if
the superscript character with catcode 7 is also a hexadecimal number, then
XeTeX behaves unexpected:
\documentclass{minimal}
\begin{document}
\begingroup
\catcode`\4=7 % superscript
[$ 4444{a} $]
\endgroup
\end{document}
In case of TeX/e-TeX/pdfTeX and LuaTeX the two superscript characters 44
are
followed by two 44
hexadecimal digits, the result is letter D (character code 0x44, decimal 68) and {a}
follows that gives variable "a" in math mode. LuaTeX does not
see four superscript characters, because they are not followed by four hexadecimal
digits.
XeTeX first sees four superscript characters. But they are not followed by
four hexadecimal digits. It switches to the case ^^c
, where two superscript
characters are followed by a non-hexadecimal character. The result is "t" (0x74
= 0x34 ('4') + 64). The fourth "4" is then treated as superscript that
raises the following {a}
. But c
is "4", a hexadecimal digit. XeTeX should have
applied case ^^xx
. Therefore I consider this behaviour as bug.
(Edit: Correction for next paragraph, 65536 is correct and 256 was wrong — I had
looked at the wrong section in the web change file.)
Also the problem with \show
is indeed a bug. The character is printed calling the procedure print
with its character code as argument. If this code is less than biggest_char
, then the character is printed, otherwise the code is interpreted
as string id and the string with the id is printed instead (procedure print
).
The definition of biggest_char
:
@d biggest_char=65536 {the largest allowed character number;
must be |<=max_quarterword|}
Characters ≤ U+FFFF are shown correctly, beyond the characters
affected are the characters beyond. This can be used for debugging
the string pool ⌣:
\catcode`\{=1
\catcode`\}=2
\catcode`\#=6
\catcode`\^=7
\catcode9=10
\def\msg#{\immediate\write16}
% #1: string index
% #2: string value
\def\StringPrintEntry#1#2{%
\msg{[#1: #2]}%
}
% #1: string index
% #2: messge
\def\StringError#1#2{%
\msg{! [#1] #2}%
}
\def\StringPrintCount#1{%
\msg{==> #1 strings available.}%
}
% #1: start index (zero based, including)
% #2: number of entries
\def\StringList#1#2{%
\begingroup
\countdef\i=11 % string index
\countdef\m=12 % limit for string index
\countdef\u=13 % char index
\chardef\1=1 %
\i=#1\relax
\m=#2\relax
\advance\m\i
\u=\i
\advance\u "10000\relax
\StringProcess
\endgroup
}
\def\StringProcess{
\ifnum\i<\m
\ifnum\u<"110000 %
\lccode`0=\u
\lowercase{%
\edef\x{%
{\the\i}%
{\expandafter\StringStripPrefix\meaning 0}%
}%
}%
\expandafter\StringProcessEntry\x
\advance\i\1
\advance\u\1
\else
\let\StringProcess\relax
\expandafter\StringError\expandafter
{\the\i}%
{Index is too large.}%
\fi
\else
\let\StringProcess\relax
\fi
\StringProcess
}
\def\StringProcessEntry#1#2{%
\def\s{#2}%
\ifx\s\StringInvalid
\ifnum\i=5 % magic
\StringPrintEntry{#1}{#2}%
\else
\let\StringProcess\relax
\StringError{#1}{End of strings reached.}%
\fi
\else
\StringPrintEntry{#1}{#2}%
\fi
}
\def\StringStripPrefix#1 #2 {}
\def\StringInvalid{???}
\def\StringCount{%
\begingroup
\countdef\c=14 %
\c=0 %
\def\StringPrintEntry##1##2{%
\advance\c\1
\xdef\StringResult{\the\c}%
}%
\def\StringError##1##2{}%
\StringList{0}{"10FFFF}
\endgroup
\StringPrintCount{\StringResult}%
}
\StringList{0}{40}
\StringCount
\csname @@end\endcsname\end
The result:
This is XeTeX, Version 3.1415926-2.4-0.9998 (MiKTeX 2.9) (INITEX)
(C:\Users\one\test\test-xetex-strings.tex
[0: .4]
[1: .9998]
[2: buffer size]
[3: pool size]
[4: number of strings]
[5: ???]
[6: m2d5c2l5x2v5i]
[7: End of file on the terminal!]
[8: ! ]
[9: (That makes 100 errors; please try again.)]
[10: ? ]
[11: Type <return> to proceed, S to scroll future error messages,]
[12: R to run without stopping, Q to run quietly,]
[13: I to insert something, ]
[14: E to edit your file,]
[15: 1 or ... or 9 to ignore the next 1 to 9 tokens of input,]
[16: H for help, X to quit.]
[17: OK, entering ]
[18: batchmode]
[19: nonstopmode]
[20: scrollmode]
[21: ...]
[22: insert>]
[23: I have just deleted some text, as you asked.]
[24: You can now delete more, or insert, or whatever.]
[25: Sorry, I don't know how to help in this situation.]
[26: Maybe you should try asking a human?]
[27: Sorry, I already gave what help I could...]
[28: An error might have occurred before I noticed any problems.]
[29: ``If all else fails, read the instructions.'']
[30: (]
[31: Emergency stop]
[32: TeX capacity exceeded, sorry []
[33: If you really absolutely need more capacity,]
[34: you can ask a wizard to enlarge me.]
[35: This can't happen (]
[36: I'm broken. Please show this to someone who can fix can fix]
[37: I can't go on meeting you like this]
[38: One of your faux pas seems to have wounded me deeply...]
[39: in fact, I'm barely conscious. Please fix it and try again.]
==> 1381 strings available.
)
No pages of output.
Transcript written on test-xetex-strings.log.
Best Answer
The LuaTeX manual says quite clearly that
\char
now accepts values between 0 and 1114111 and extends this statement to the other similar commands like\lccode
. As far as I know this is true in XeTeX too. And IMHO a command line to change this seems rather senseless.But the eTeX extension can be disabled in both engines. Or more precisely: XeTeX has a command line option to enable the eTeX extension (which is used by default by all TeX Systems) and LuaTeX has a similar feature. So it is possible to build a format manually which doesn't use them. But this affects only the eTeX relevant commands.