I'm a cofounder at writeLaTeX.
We don't currently use a background daemon. Our backend uses pdflatex on Linux, so I can't say much about XeLaTeX on Windows (but XeLaTeX support is planned), but here's our experience.
The main factor that determines the compile time for a small document is whether the many source files for the packages it uses are already in the linux disk cache. (See http://www.linuxatemyram.com for an overview of the disk cache and some good links.)
That is, the first latex document you compile tends to be slow, because latex has to read all of those files from their various locations on your hard disk. But, when you compile the second document, the operating system has helpfully kept those files in main memory since it read them the first time, so reading them in again is much faster.
I also know that jpallen of ShareLaTeX now maintains the CLSI, which is open source, so you can see how that backend works. I don't think it uses a background service, either.
As far as I can see, the daemon-based approaches in the links you provided still work in principle, but I don't know whether they're still supported.
So, first let me say that this is probably the most thoroughly researched question I've read so far. Congratulations.
I'll be reusing some of David's comments. The hashing algorithm takes the csname as an array of bytes (probably some differences there for XeTeX and LuaTeX, let me focus on 8-bit engines) and computes the sum of csname[i]*2^(len-i) % prime
where csname[i]
denotes the i
-th byte in the csname, len
is the total number of characters, and the result is computed modulo the hash prime
. The control sequence is then stored or retrieved at the resulting position in the hash table. If there was never any hash collision, then storage and retrieval would be easy and fast, since computing the hash is very fast (additions only). When there is a collision, TeX must check whether the command that it found in the hash table is indeed the csname that was searched for: this is done by comparing strings one character at a time, and is somewhat slower than the hashing algorithm (I think). TeX may need to compare all csnames with the same hash (it can stop once it finds the csname it was looking for).
When comparing strings which happen to be different but have the same hash, TeX goes through them one character at a time, and can stop comparing as soon as it finds a difference: there lies the difference between a long prefix and a long suffix. In scenarios where there are many hash collisions (e.g., with \number
), most of the time is spent comparing strings which happen to have the same hash. If such strings differ on an early character (i.e., short prefix), then they will be recognized as different pretty fast. If they have long common prefixes, then TeX will have to scan them for a long time (until the first difference) before it can declare them distinct.
Why is \number
bad for hash collisions? Say that your array keys are 5-digit numbers. Compared to the hash of 00000
, the hash of abcde
(where a
, etc, stand for digits) is 16*a+8*b+4*c+2*d+e
. For instance, 98765
has the same hashcode as 00000
plus 16*9+8*8+4*7+2*6+5=253
. This is tiny. In fact, hash codes for all sets of five digits all lie among 280 possible values! A typical hash value corresponds to more than 300 csnames, and on average TeX will have to test half of them before finding the one it wants.
How do we spread out hashes? The main culprit in having only 280 different values with \number
is that Knuth chose a multiplier equal to 2 only. This is justified when names are more or less random sequences of letters of varying length, as in typical TeX documents (adding one character multiplies the whole hash value by 2, so collisions between strings of different lengths are not particularly likely regardless of their internal structure). To choose my encoding, I decided to replace 16*a+8*b+4*c+2*d+e
above by 16^4 * a + 16^3 * b + 16^2 * c + 16 * d + e
. This is simply done by inserting three dummy characters between each decimal digit, turning 98765
into 9---8---7---6---5
. Why insert three rather than more? Because 16 is the smallest power of 2 greater than 10. Choosing 8 instead of 16 would still lead to hash collisions: the hashes of 0--0--0--0--0
and 9--9--9--9--9
are only 9*8^4+9*8^3+...+9=42129
apart, so a typical hash would correspond to two or three csnames.
The last step was simply to convert this to code:
\def\step#1{#1---\step}
\def\endstep#1\step{}
\message{\step 123 \endstep}
Here I use a typical technique to optimize loops: define the looping macro to do what you want and call itself again (here we want #1->#1---
), then figure out what to feed as #1
to break the loop. This avoids any test hence is as fast as can be while supporting an arbitrary length of input. Note also that I am putting dashes after the digit (#1---
instead of ---#1
) in an effort to put relevant stuff early in the csname.
For the specific application of inserting stuff within numbers, we might be able to do better (haven't benchmarked): numbers will not have more than 10 digits (actually 6 might be a reasonable bound, so remove #6
through #9
). The following should work and may be faster
\def\dashes#1#2#3#4#5#6#7#8#9{#1---#2---#3---#4---#5---#6---#7---#8---#9---}
\message{\dashes 123********}
(at least 8 stars after the number, to support numbers of arbitrary lengths between 1 and 10).
Yet another option is to use \romannumeral
on each digit, which reduces hash collisions because lengths vary. I don't know how this fares.
One thing is certain: csnames of the form <prefix><varying><postfix>
of a constant length can have 256*2^(len-1)
different hashes (where len
is the length of the varying part), and more realistically 10*2^(len-1)
hashes only. Getting 10^5
distinct hashes absolutely requires 10 varying bytes, or some variation on the length of the prefix, varying or postfix parts. In practical settings (where key encoding is taken into account, hence where it is difficult to get more than 10 different leading bytes), one needs about 15 bytes of varying material. My \step
approach uses 20 (or 17, depending on whether you count the trailing ---
). My \dashes
approach may use various numbers depending on whether it is tailored to a specific number length.
Here are ways to implement arrays, depending on your use case. First, let us focus on arrays which can contain arbitrary tokens.
\toks
registers. If you are defining and using your array in a controlled environment, use \toks
registers within a group. This is limited to 32768 registers (65536 for LuaTeX, I think). Access is faster than csnames, and there is no need for tricks. In fact, this technique can also be used if your array is defined somewhere else in another form, but you need to access it a lot at a given place (within controlled code). This (is|has been) used in l3sort
for merge sort.
Multiple \csname
approach. This is what you do, and has been discussed at length here. There is no limit in the number of slots, and keys may be other than numeric, but for numeric keys some trickery is needed to avoid hash collisions. Also, you cannot have many such arrays in memory at the same time lest you fill your hash table completely.
A single \csname
containing everything. This is essentially what l3prop
does at the time of writing. Mapping through all keys is straightforward and fast (per element). Accessing individual elements in an arbitrary order will be very slow, since the whole array needs to be expanded from a macro every time we want to access one element.
Hybrid approaches may be better if your access to given items happens in clusters (e.g., you want to look at keys 1–2000, then 1000–3000, then 2000–4000): you could allocate a certain range of \toks
registers for this particular purpose and update once in a while which items are stored in this range. You could also store most stuff in a single csname and only send some items to individual csnames, or toks registers.
Another type of array has numeric values, be it integers, dimensions, and the like.
Various registers can be used in the same way as \toks
registers above. A typical example of this is l3regex
, which at the time of writing abuses toks
, dim
, skip
, and muskip
registers extensively to store all sorts of information. Do not use integer registers, as stuff like \m@ne
would break, but expect other stuff to break too (\maxdimen
,...). Using dimension or skip or muskip registers can get messy when there is a need to convert from one to the other: of course, we use stretch and shrink components too, which leads to \number\glueshrink\mutoglue\muskip3\relax
-type constructions.
A more practical set of numbers is given by the \fontdimen
parameters of a font. In effect, those behave like a usual C array whose size must be declared before starting to use it, and whose items are TeX dimens, i.e., integers less than 2^30 in magnitude. All assignments are global. Retrieval speed is comparable to other registers, perhaps a bit slower. See a question I asked on this topic for some other comments.
Of course, the toks
, and csname
approaches described earlier work fine for numeric data too, but it seems inefficient compared to \fontdimen
, especially concerning memory usage (each \fontdimen
parameter uses just a few bytes of memory, versus many bytes for the csname and many more for tokens representing the value).
On a similar topic of optimizing data structures:
A long string of characters is stored more efficiently as a csname than as a list of tokens. This has been used at some point in l3fp
, storing 10000 decimals of 1/(2pi)
as \159154943091895...
, to be passed to \string
when needed.
Trees which only contain numeric data and only need to be accessed non-expandably can be stored and manipulated as nested boxes, with data stored as various kern nodes for instance. I played with this to implement sorting algorithms some time ago, but the code was messier than the current l3sort
and not significantly faster. I'll look at it again some day. It might be possible to reuse this for the relationship between document elements.
Best Answer
In the end, it is only one thing that really makes a difference: Single-Threaded CPU power!
I do a lot of LaTeX compiling, especially of large beamer presentations (lecture notes with 500+ slides, lots of overlays and lots of TikZ stuff). Compilation time with three pdflatex runs usually takes minutes.
I have done experiments with this setting on a bunch of different machines at our lab, from my dual-core notebook over ordinary quadcore PCs with 2 to 4 GB of RAM up to an 48-core AMD server with 32 GB of RAM and ultra-fast SCSI disks.
I don't have recorded the actual numbers of these tests. However, the result roughly is: compilation times scale nearly linear with the CPU clock. It's the fasted Core i7 that makes the run; memory and disk plays only a minor role (given that you provide modest quality and size of both).
The results should not be surprising:
None of the existing TeX/LaTeX compilers use thread-level parallelism and, thus, would profit significantly from multiple CPU cores. Also the typical three pdflatex runs cannot be parallelized, they have to be executed sequentially. In some cases it might, however, be possible to use process-level parallelism, so that multiple cores can speed up the compilation process:
I have not done any tests regarding cache size, but as the memory load is low, we can not expect a significant benefit of extra-large caches.
To sum up: For a "LaTeX machine" it is the processor and in particularly it's single-core throughput one should look for.
However, the latter is not always easy to figure out, as basically all processors nowadays are multi-core and benchmarked and advertised with their multi-core throughput. A good starting point is the PassMark CPU Benchmarks site, which provides a frequently updated report on "Single Thread Performance" for all current IA32e/AMD64 CPUs.