What is a "correct" way to indent TeX documents?
I believe there is no "correct" way -- certainly no single correct way -- to indent input lines of an ordinary TeX or LaTeX document. (The situation may be different for input that adheres to LaTeX3 coding methods, but I'm not sure of that.) As long as TeX can parse the input, it's completely agnostic as to the form of the input it receives. Any indentation practices -- whether indenting with "soft tabs" or "hard" spaces, and whether the indentation amounts to 2, 4, or 8 spaces -- are therefore entirely cosmetic and voluntary in nature. To be sure, I don't believe anyone in their right mind would advocate not using some rule for indenting parts of one's TeX code.
A separate thought: In order to raise the readability of TeX input code, a technique that is probably as important as (or possibly even more important than) adopting some rule for indentation is using vertical whitespace liberally and consistently -- between paragraphs, above a sectioning command, before and after a \newenvironment
statement, before and after a LaTeX float
, etc.
So, first let me say that this is probably the most thoroughly researched question I've read so far. Congratulations.
I'll be reusing some of David's comments. The hashing algorithm takes the csname as an array of bytes (probably some differences there for XeTeX and LuaTeX, let me focus on 8-bit engines) and computes the sum of csname[i]*2^(len-i) % prime
where csname[i]
denotes the i
-th byte in the csname, len
is the total number of characters, and the result is computed modulo the hash prime
. The control sequence is then stored or retrieved at the resulting position in the hash table. If there was never any hash collision, then storage and retrieval would be easy and fast, since computing the hash is very fast (additions only). When there is a collision, TeX must check whether the command that it found in the hash table is indeed the csname that was searched for: this is done by comparing strings one character at a time, and is somewhat slower than the hashing algorithm (I think). TeX may need to compare all csnames with the same hash (it can stop once it finds the csname it was looking for).
When comparing strings which happen to be different but have the same hash, TeX goes through them one character at a time, and can stop comparing as soon as it finds a difference: there lies the difference between a long prefix and a long suffix. In scenarios where there are many hash collisions (e.g., with \number
), most of the time is spent comparing strings which happen to have the same hash. If such strings differ on an early character (i.e., short prefix), then they will be recognized as different pretty fast. If they have long common prefixes, then TeX will have to scan them for a long time (until the first difference) before it can declare them distinct.
Why is \number
bad for hash collisions? Say that your array keys are 5-digit numbers. Compared to the hash of 00000
, the hash of abcde
(where a
, etc, stand for digits) is 16*a+8*b+4*c+2*d+e
. For instance, 98765
has the same hashcode as 00000
plus 16*9+8*8+4*7+2*6+5=253
. This is tiny. In fact, hash codes for all sets of five digits all lie among 280 possible values! A typical hash value corresponds to more than 300 csnames, and on average TeX will have to test half of them before finding the one it wants.
How do we spread out hashes? The main culprit in having only 280 different values with \number
is that Knuth chose a multiplier equal to 2 only. This is justified when names are more or less random sequences of letters of varying length, as in typical TeX documents (adding one character multiplies the whole hash value by 2, so collisions between strings of different lengths are not particularly likely regardless of their internal structure). To choose my encoding, I decided to replace 16*a+8*b+4*c+2*d+e
above by 16^4 * a + 16^3 * b + 16^2 * c + 16 * d + e
. This is simply done by inserting three dummy characters between each decimal digit, turning 98765
into 9---8---7---6---5
. Why insert three rather than more? Because 16 is the smallest power of 2 greater than 10. Choosing 8 instead of 16 would still lead to hash collisions: the hashes of 0--0--0--0--0
and 9--9--9--9--9
are only 9*8^4+9*8^3+...+9=42129
apart, so a typical hash would correspond to two or three csnames.
The last step was simply to convert this to code:
\def\step#1{#1---\step}
\def\endstep#1\step{}
\message{\step 123 \endstep}
Here I use a typical technique to optimize loops: define the looping macro to do what you want and call itself again (here we want #1->#1---
), then figure out what to feed as #1
to break the loop. This avoids any test hence is as fast as can be while supporting an arbitrary length of input. Note also that I am putting dashes after the digit (#1---
instead of ---#1
) in an effort to put relevant stuff early in the csname.
For the specific application of inserting stuff within numbers, we might be able to do better (haven't benchmarked): numbers will not have more than 10 digits (actually 6 might be a reasonable bound, so remove #6
through #9
). The following should work and may be faster
\def\dashes#1#2#3#4#5#6#7#8#9{#1---#2---#3---#4---#5---#6---#7---#8---#9---}
\message{\dashes 123********}
(at least 8 stars after the number, to support numbers of arbitrary lengths between 1 and 10).
Yet another option is to use \romannumeral
on each digit, which reduces hash collisions because lengths vary. I don't know how this fares.
One thing is certain: csnames of the form <prefix><varying><postfix>
of a constant length can have 256*2^(len-1)
different hashes (where len
is the length of the varying part), and more realistically 10*2^(len-1)
hashes only. Getting 10^5
distinct hashes absolutely requires 10 varying bytes, or some variation on the length of the prefix, varying or postfix parts. In practical settings (where key encoding is taken into account, hence where it is difficult to get more than 10 different leading bytes), one needs about 15 bytes of varying material. My \step
approach uses 20 (or 17, depending on whether you count the trailing ---
). My \dashes
approach may use various numbers depending on whether it is tailored to a specific number length.
Here are ways to implement arrays, depending on your use case. First, let us focus on arrays which can contain arbitrary tokens.
\toks
registers. If you are defining and using your array in a controlled environment, use \toks
registers within a group. This is limited to 32768 registers (65536 for LuaTeX, I think). Access is faster than csnames, and there is no need for tricks. In fact, this technique can also be used if your array is defined somewhere else in another form, but you need to access it a lot at a given place (within controlled code). This (is|has been) used in l3sort
for merge sort.
Multiple \csname
approach. This is what you do, and has been discussed at length here. There is no limit in the number of slots, and keys may be other than numeric, but for numeric keys some trickery is needed to avoid hash collisions. Also, you cannot have many such arrays in memory at the same time lest you fill your hash table completely.
A single \csname
containing everything. This is essentially what l3prop
does at the time of writing. Mapping through all keys is straightforward and fast (per element). Accessing individual elements in an arbitrary order will be very slow, since the whole array needs to be expanded from a macro every time we want to access one element.
Hybrid approaches may be better if your access to given items happens in clusters (e.g., you want to look at keys 1–2000, then 1000–3000, then 2000–4000): you could allocate a certain range of \toks
registers for this particular purpose and update once in a while which items are stored in this range. You could also store most stuff in a single csname and only send some items to individual csnames, or toks registers.
Another type of array has numeric values, be it integers, dimensions, and the like.
Various registers can be used in the same way as \toks
registers above. A typical example of this is l3regex
, which at the time of writing abuses toks
, dim
, skip
, and muskip
registers extensively to store all sorts of information. Do not use integer registers, as stuff like \m@ne
would break, but expect other stuff to break too (\maxdimen
,...). Using dimension or skip or muskip registers can get messy when there is a need to convert from one to the other: of course, we use stretch and shrink components too, which leads to \number\glueshrink\mutoglue\muskip3\relax
-type constructions.
A more practical set of numbers is given by the \fontdimen
parameters of a font. In effect, those behave like a usual C array whose size must be declared before starting to use it, and whose items are TeX dimens, i.e., integers less than 2^30 in magnitude. All assignments are global. Retrieval speed is comparable to other registers, perhaps a bit slower. See a question I asked on this topic for some other comments.
Of course, the toks
, and csname
approaches described earlier work fine for numeric data too, but it seems inefficient compared to \fontdimen
, especially concerning memory usage (each \fontdimen
parameter uses just a few bytes of memory, versus many bytes for the csname and many more for tokens representing the value).
On a similar topic of optimizing data structures:
A long string of characters is stored more efficiently as a csname than as a list of tokens. This has been used at some point in l3fp
, storing 10000 decimals of 1/(2pi)
as \159154943091895...
, to be passed to \string
when needed.
Trees which only contain numeric data and only need to be accessed non-expandably can be stored and manipulated as nested boxes, with data stored as various kern nodes for instance. I played with this to implement sorting algorithms some time ago, but the code was messier than the current l3sort
and not significantly faster. I'll look at it again some day. It might be possible to reuse this for the relationship between document elements.
Best Answer
The alternative to TeX is Microsoft Word. Good luck with that.
Backwards compatibility of Word documents breaks after a few release cycles. LaTeX has been stable for 25+ years.
Word doesn't search for fonts either. That's a operating system issue.
It's not hard to make macros available. At the very least all you have to do is put the package file in the same directory as the document file. Every distribution also comes with a default user location to put macro files.
So...use the ones which do?
On my machine I double-click a TTF font and the OS installs it in my font library. XeLaTeX finds that font with zero problem.
There aren't macros in word files like in TeX files, but Word doesn't auto-heal either. If the file uses fonts that aren't on the machine, it won't get them automatically.
The
h
command at a message provides the best possible guess to the source of an error. But the same objection could be given to many languages.By "odd bugs introduced by escaping and quoting" you mean in users' input files? That's not the fault of TeX, and changing the input file format is not going to fix PEBKAC errors. The TeX developers have fixed 9 bugs in 10 years.
Maybe this is a bit of a strawman argument since you didn't propose Word as a substitute, but I don't think your objections are unique to *TeX or warrant replacement.