[Tex/LaTex] How to implement (low-level) arrays in TeX

data structuresmacrosperformancetex-core

Update: I have revised my question based on the comment by Bruno Le Floch and some further research.


As some may know, I'm implementing a huge TeX macro package which has to do a lot of internal computations, and hence will give a lot less overall performance "per page" than, for instance, your usual LaTeX document.

Looking for quick wins in enhancing performance, I'm currently investigating what are the most performant ways to do some really basic computations in TeX.

One thing I'm using a lot, are array-like structures where large amounts of values are stored in main memory, accessible by one or several numeric keys.

So my question here is basically:

What is the best way to implement a one- or multi-dimensional array in TeX, concentrating on numerical access keys?

What I know so far

Currently, to store <value> in array <myarray> with key <key>, I basically do

\expandafter\def\csname<myarray>@<key>@DocScape\endcsname{<value>}

(assuming <key> is in a form which can go in a cs name).

Of course, this doesn't give a true random-access, compact array structure, but instead uses the internal hash table in the hope to get good performance for inserting a new array element and "almost" O(1) access to retrieve array elements.

Additionally, I can't imagine any other way of getting comparable performance with a TeX implementation.

Hence, for the moment I'm looking into ways of improving the performance of this basic technique.

Testing the raw hash performance

I've heard several times now that TeXs hashing algorithm for cs names is not so good, so one performance problem could be to get a lot of hash conflicts, leading to a lot worse performance than O(1), but I haven't looked into the sources, so I don't know in what way exactly it is bad. Also, I haven't found any precise information on this subject on this site.

If hash performance is bad, I can imagine the following things which may influence the overall performance of the array construct:

  1. Total length of cs name
  2. Length of a "constant" prefix before the "significant part" (i.e. the encoding of <key>).
  3. The way <key> is encoded.

To test what form of constructing the cs name for array elements gives the best performance, I made the following test suite:

\input random

% Three different implementations of a "hash array" with numerical
% keys using different key encodings.

\input numhasharray.tex

\input hashnumhasharray.tex

\input codenumhasharray.tex

\newcount\iter
\newcount\rcount
\def\gobble#1{}



% A macro for setting up one experiment to test the "raw hash performance".

\def\mkhashexperiment#1#2#3#4#5%
{%
  \iter0

  \immediate\openout2=#1-setup.tex

  \immediate\write2{\noexpand\message{starting test "#1"; size #2; prefix "#3";}}
  \immediate\write2{\noexpand\message{postfix "#4"; coding \string\string\string#5}}%

  \loop
   \ifnum\iter<#2
    \advance\iter1
    \immediate\write2
    {%
      \noexpand\expandafter\noexpand\def
      \noexpand\csname#3#5\iter#4\noexpand\endcsname{\number\iter}%
    }%
  \repeat

  \immediate\closeout2 %

  \iter0

  \immediate\openout2=#1.tex

  \immediate\write2{\expandafter\gobble\string\\input\space#1-setup}%

  \immediate\write2{\expandafter\gobble\string\\newcount\string\mycount}%
  \loop
   \ifnum\iter<#20
    \advance\iter1
    \setrannum{\rcount}{1}{#2}
    \immediate\write2
    {%
      % What is the most performant way to "just retrieve" the array
      % value without causing any further computation cost?
      \noexpand\mycount
      \noexpand\csname#3#5\rcount#4\noexpand\endcsname\relax
    }%
  \repeat

  \immediate\write2{\expandafter\gobble\string\\bye}%
  \immediate\closeout2 %

  \immediate\write1{time tex #1;}
}


% A macro for setting up one experiment to test the performance of an
% array implementation.

\def\mkarrayexperiment#1#2#3#4#5%
{%
  \iter0

  \immediate\openout2=#1-setup.tex

  \immediate\write2{\noexpand\message{starting test "#1"; size #2;
  array implementation "#3"}}%

  \immediate\write2{\expandafter\gobble\string\\input\space#3}%

  \loop
   \ifnum\iter<#2
    \advance\iter1
    \immediate\write2
    {%
      \string#4{foo}{\number\iter}{\number\iter}%
    }%
  \repeat

  \immediate\closeout2 %

  \iter0

  \immediate\openout2=#1.tex

  \immediate\write2{\expandafter\gobble\string\\input\space#3}%

  \immediate\write2{\expandafter\gobble\string\\input\space#1-setup}%

  \immediate\write2{\expandafter\gobble\string\\newcount\string\mycount}%
  \loop
   \ifnum\iter<#20
    \advance\iter1
    \setrannum{\rcount}{1}{#2}
    \immediate\write2
    {%
      % What is the most performant way to "just retrieve" the array
      % value without causing any further computation cost?
      \noexpand\mycount\string#5{foo}{\number\rcount}\relax
    }%
  \repeat

  \immediate\write2{\expandafter\gobble\string\\bye}%
  \immediate\closeout2 %

  \immediate\write1{time tex #1;}
}


% Number of array entries to generate; 10x this number of random retrievals
% is generated.

\def\experimentsize{100000}

% Execute this sh script to run the tests.

\immediate\openout1=testhash.sh

\immediate\write1{(}


% Testing the performance of the internal hash table.

% The usual way of implementing an array: Just put the access key in
% the name directly as a \number. 

\mkhashexperiment{testhash1}{\experimentsize}{pre}{}{\number}
\mkhashexperiment{testhash2}{\experimentsize}{}{post}{\number}
\mkhashexperiment{testhash3}{\experimentsize}{verylongprefixtostresshashtable}{}{\number}
\mkhashexperiment{testhash4}{\experimentsize}{}{verylongprefixtostresshashtable}{\number}
\mkhashexperiment{testhash5}{\experimentsize}{verylong}{prefixtostresshashtable}{\number}
\mkhashexperiment{testhash6}{\experimentsize}{verylongprefixto}{stresshashtable}{\number}


% Encoding provided by Bruno le Floch to optimise number hashing.

\mkhashexperiment{testhash19}{\experimentsize}{pre}{}{\hashnumber}
\mkhashexperiment{testhash20}{\experimentsize}{}{post}{\hashnumber}
\mkhashexperiment{testhash21}{\experimentsize}{verylongprefixtostresshashtable}{}{\hashnumber}
\mkhashexperiment{testhash22}{\experimentsize}{}{verylongprefixtostresshashtable}{\hashnumber}
\mkhashexperiment{testhash23}{\experimentsize}{verylong}{prefixtostresshashtable}{\hashnumber}
\mkhashexperiment{testhash24}{\experimentsize}{verylongprefixto}{stresshashtable}{\hashnumber}


% Best performance I found so far.

\mkhashexperiment{testhash25}{\experimentsize}{pre}{}{\mynumcode}
\mkhashexperiment{testhash26}{\experimentsize}{}{post}{\mynumcode}
\mkhashexperiment{testhash27}{\experimentsize}{verylongprefixtostresshashtable}{}{\mynumcode}
\mkhashexperiment{testhash28}{\experimentsize}{}{verylongprefixtostresshashtable}{\mynumcode}
\mkhashexperiment{testhash29}{\experimentsize}{verylong}{prefixtostresshashtable}{\mynumcode}
\mkhashexperiment{testhash30}{\experimentsize}{verylongprefixto}{stresshashtable}{\mynumcode}


% Testing the performance of different array implementations.

% Hash array with simple numerical keys.

\mkarrayexperiment{testnumhasharray}{\experimentsize}{numhasharray}{\numhashstore}{\numhashretrieve}

% Hash array with encoding provided by Bruno le Floch to optimise
% number hashing. 

\mkarrayexperiment{testhashnumhasharray}{\experimentsize}{hashnumhasharray}{\hashnumhashstore}{\hashnumhashretrieve}

% Hash array with "hash spread" encoding.

\mkarrayexperiment{testcodenumhasharray}{\experimentsize}{codenumhasharray}{\codenumhashstore}{\codenumhashretrieve}


\immediate\write1{) \string&> testhash.log}

\immediate\closeout1 %

\bye

To generate the tests, you need the following implementation files for different hash-based array variants:

numhasharray.tex

% The most basic "hash array" for numeric keys: 
% just use the number as a key.

\def\numhashstore#1#2#3{\expandafter\def\csname\number#2\string_#1\string_nh\endcsname{#3}}
\def\numhashretrieve#1#2{\csname\number#2\string_#1\string_nh\endcsname}

hashnumhasharray.tex

% This encoding is by Bruno Le Floch, directly constructed to optimise
% hash performance.

\def\step#1{#1---\step}
\def\endstep#1\step{}
\def\hashnumber#1{\expandafter\step \number#1 \endstep}


\def\hashnumhashstore#1#2#3{\expandafter\def\csname\hashnumber{#2}\string_#1\string_hnh\endcsname{#3}}
\def\hashnumhashretrieve#1#2{\csname\hashnumber{#2}\string_#1\string_hnh\endcsname}

codenumhasharray.tex

% My own encoding for numerical keys, hoping to spread out hash
% codes. In particular, I'm trying to get different cs name
% lengths. I'm indepted to Bruno Le Floch for the neat way of ending
% the recursion without an \if construct.

\expandafter\def\csname numkeya0\endcsname{a1Y@}
\expandafter\def\csname numkeya1\endcsname{b}
\expandafter\def\csname numkeya2\endcsname{c2}
\expandafter\def\csname numkeya3\endcsname{dZ}
\expandafter\def\csname numkeya4\endcsname{e3'}
\expandafter\def\csname numkeya5\endcsname{f}
\expandafter\def\csname numkeya6\endcsname{g4!}
\expandafter\def\csname numkeya7\endcsname{h}
\expandafter\def\csname numkeya8\endcsname{i5-}
\expandafter\def\csname numkeya9\endcsname{j"}
\expandafter\def\csname numkeya;\endcsname\myencodeb{}

\expandafter\def\csname numkeyb0\endcsname{k6}
\expandafter\def\csname numkeyb1\endcsname{l;}
\expandafter\def\csname numkeyb2\endcsname{m7/}
\expandafter\def\csname numkeyb3\endcsname{n}
\expandafter\def\csname numkeyb4\endcsname{o8}
\expandafter\def\csname numkeyb5\endcsname{p(:}
\expandafter\def\csname numkeyb6\endcsname{q9}
\expandafter\def\csname numkeyb7\endcsname{r}
\expandafter\def\csname numkeyb8\endcsname{s0(}
\expandafter\def\csname numkeyb9\endcsname{t,}
\expandafter\def\csname numkeyb;\endcsname\myencodec{}

\expandafter\def\csname numkeyc0\endcsname{uO}
\expandafter\def\csname numkeyc1\endcsname{v)}
\expandafter\def\csname numkeyc2\endcsname{wP}
\expandafter\def\csname numkeyc3\endcsname{x<}
\expandafter\def\csname numkeyc4\endcsname{yQ=}
\expandafter\def\csname numkeyc5\endcsname{z}
\expandafter\def\csname numkeyc6\endcsname{AR}
\expandafter\def\csname numkeyc7\endcsname{B?>}
\expandafter\def\csname numkeyc8\endcsname{CS}
\expandafter\def\csname numkeyc9\endcsname{D}
\expandafter\def\csname numkeyc;\endcsname\myencoded{}

\expandafter\def\csname numkeyd0\endcsname{ET[|}
\expandafter\def\csname numkeyd1\endcsname{F}
\expandafter\def\csname numkeyd2\endcsname{GU}
\expandafter\def\csname numkeyd3\endcsname{H]}
\expandafter\def\csname numkeyd4\endcsname{IV}
\expandafter\def\csname numkeyd5\endcsname{J}
\expandafter\def\csname numkeyd6\endcsname{KW*}
\expandafter\def\csname numkeyd7\endcsname{L}
\expandafter\def\csname numkeyd8\endcsname{MX}
\expandafter\def\csname numkeyd9\endcsname{N+}
\expandafter\def\csname numkeyd;\endcsname\myencodea{}

\def\mynumcode#1{\expandafter\myencodea\number#1;}
\def\myencodea#1{\csname numkeya#1\endcsname\myencodeb}
\def\myencodeb#1{\csname numkeyb#1\endcsname\myencodec}
\def\myencodec#1{\csname numkeyc#1\endcsname\myencoded}
\def\myencoded#1{\csname numkeyd#1\endcsname\myencodea}

\def\codenumhashstore#1#2#3{\expandafter\def\csname\mynumcode{#2}\string_#1\string_cnh\endcsname{#3}}
\def\codenumhashretrieve#1#2{\csname\mynumcode{#2}\string_#1\string_cnh\endcsname}

This will generate a couple of TeX files and a shell script to run a number of tests for comparing performance of different cs name constructs.

\mkhashexperiment{testhash1}{100000}{pre}{post}{\number}

generates a TeX file testhash1.tex creating an an array construct with 100000 successively numbered entries and 1000000 random retrievals where the cs name is

\csname pre\number<key>post\endcsname

for <key>s between 1 and 100000.

Comparing the running times of the tests testhash1 to testhash6 above I get on my laptop (quoting from testhash.log):

starting test "testhash1"; size 100000; prefix "pre";
postfix ""; coding \number) )
real    0m4.642s
user    0m4.296s
sys 0m0.068s

starting test "testhash2"; size 100000; prefix "";
postfix "post"; coding \number) )
real    0m3.827s
user    0m3.748s
sys 0m0.036s

starting test "testhash3"; size 100000; prefix "verylongprefixtostresshashtable
"; postfix ""; coding \number) )
real    0m16.614s
user    0m16.265s
sys 0m0.160s

starting test "testhash4"; size 100000; prefix "";
postfix "verylongprefixtostresshashtable"; coding \number) )
real    0m6.317s
user    0m6.176s
sys 0m0.072s

starting test "testhash5"; size 100000; prefix "verylong";
postfix "prefixtostresshashtable"; coding \number) )
real    0m8.971s
user    0m8.789s
sys 0m0.116s

starting test "testhash6"; size 100000; prefix "verylongprefixto";
postfix "stresshashtable"; coding \number) )
real    0m13.337s
user    0m12.753s
sys 0m0.112s

(Note that these are simple system timing results on my laptop computer, so they are probably slightly skewed by other processes on the system. Still the numbers are basically reproducible on repetition.)

The first (maybe surprising) result is that this amount of variation alone creates a huge difference in overall performance, with a difference of up to a factor 4 in running time. If it was possible to discount the fixed amount of additional processing (I used a counter assignment to access the array value), the difference would probably be much, much larger!

Now I heard that hash performance is particularly bad for sequences of digits, so \number is possibly not a good idea for encoding the array access key.

Bruno Le Floch gave a comment in which he describes how to optimize number representation for hashing. This is implemented in the macro \hashnumber and tested by the test cases testhash19 to testhash24. The results:

starting test "testhash19"; size 100000; prefix "pre";
postfix ""; coding \hashnumber) )
real    0m2.395s
user    0m2.280s
sys 0m0.052s

starting test "testhash20"; size 100000; prefix "";
postfix "post"; coding \hashnumber) )
real    0m2.483s
user    0m2.260s
sys 0m0.048s

starting test "testhash21"; size 100000; prefix "verylongprefixtostresshashtabl
e"; postfix ""; coding \hashnumber) )
real    0m4.173s
user    0m3.740s
sys 0m0.056s

[snip]

This gives a significantly better result than when using \number:

  • The best performance is almost double that of \number (so 8 times better than the worst for \number).
  • The "spread out" of performances between different cs name constructions is almost negligible; even the worst performance for \hashnumber is better than the best performance for \number.

Of course, I'm intrigued how far this could still be enhanced by choosing a more clever encoding, but I have no clue how to go about this systematically.

In the hope to "spread out" hash keys for encoded numbers a bit more I made an encoding function myself (\mynumcode; see test suite above). It will use a lot of different characters, produce different key lengths and avoid repetitions, which hopefully leads to a better distribution of hash keys.

Using this encoding, I get the following running times:

starting test "testhash25"; size 100000; prefix "pre";
postfix ""; coding \mynumcode) )
real    0m2.194s
user    0m1.920s
sys 0m0.016s

starting test "testhash26"; size 100000; prefix "";
postfix "post"; coding \mynumcode) )
real    0m2.089s
user    0m2.020s
sys 0m0.040s

starting test "testhash27"; size 100000; prefix "verylongprefixtostresshashtabl
e"; postfix ""; coding \mynumcode) )
real    0m3.594s
user    0m3.008s
sys 0m0.036s

[snip]

You see that the timing results are slightly better than for \hashnumber, but not significantly so. So probably this is getting near optimal (for the hashing implementation of the TeX engine).

As the tests reported above are as near to testing the "raw performance" of the hash mechanism as conceivably possible (just storing and retrieving values; loading no additional macros and avoiding all computation), I dare to conclude the following:

  1. On the whole, storing array elements as separate macros with constructed cs names is feasible. Compared to the usual runtimes of TeX documents, the best given "raw" times for storing 100000 array elements and retrieving random elements for 1000000 times are OK and are probably completely dwarfed by any way in which these 1000000 values are processed further by the TeX system in a real application.
  2. Indeed, using \number to encode a numeric key is not optimal for array performance.
  3. The solution given by Bruno Le Floch for achieving a "hash-optimised" number encoding assures a significantly better hash performance at small computational cost (see also below).
  4. The total length of the cs name doesn't seem to be significant.
  5. When using \number, inexplicably the overall performance of the array implementation is almost directly inversely proportional to the "static" prefix in front of the array access key (much less so for any other number encoding).

Comparison of engines

The test results cited above are with the original TeX engine. I ran the test suite shown above also with other engines.

  • pdftex gives slightly less performance overall, but really insignificant.
  • xetex also gives slightly less performance overall, looking a little more significant (up to 1.5 times slower).

With luatex the performance differences are a bit more subtle: cases which are fast with TeX are sometimes significantly slower (up to 2 times slower), while cases where TeX is slow are getting faster, so it seems LuaTeX is better at dealing with those cases (long static prefix) where the original engine is bad.

Does everything get slower?

One more concern about using the internal hash table to store arrays as thousands of cs names is that it will slow down all processing due to

  1. the fact that the size of the hash table has to be extended for really huge amounts of data, so looking up any cs name will take more time;
  2. the fact that the generated cs names "clutter up" the hash table, leading to hash conflicts for regular macro names as well.

I tested this by generating the above test suite with setting \def\experimentsize{1000000} and using the following test file:

\documentclass{article}

%\input{testhash1-setup}
%\input{testhash25-setup}

\usepackage{pgfplots}


\let\rmdefault\sfdefault

\begin{document}

\newcommand\test[1]
{%
[can't repeat the example code here because of post length limit]
}

\test{1}

\test{2}

\test{3}

\test{4}

\test{5}

\test{6}

\end{document}

The XKCD plot example is taken from this answer. The intention for using a TikZ example is that TikZ loads a real lot of source code, defining and using heaps of macros, so if anything is wrong with hash performance for normal macro processing, this is one application where it should get most observable.

Note that the \input statements (commented out above) are executed before loading pgfplots, to maximise potential hash conflicts when loading TikZ.

I had to extend the internal tables as follows to compile the example file with one of the \input statements uncommented:

export hash_extra=1500000
export max_strings=2000000
export pool_size=30000000

First, let's look at the effect of the larger hash table. If I compile the example with both \input statements commented out with time pdflatex without any extension, I get the following result:

real    0m3.686s
user    0m3.628s
sys 0m0.044s

With the abovementioned extensions in place, I get

real    0m3.746s
user    0m3.684s
sys 0m0.052s

I would say this is a noticeable (also reproducible) slowdown, but totally insignificant for practical purposes.

Now, the timing result with \input{testhash1-setup} uncommented:

real    0m19.760s
user    0m19.585s
sys 0m0.140s

and with \input{testhash25-setup} uncommented:

real    0m10.710s
user    0m10.585s
sys 0m0.104s

And here are the timing results for "only" processing testhash1-setup with the following test file:

\documentclass{article}

\input{testhash1-setup}

\begin{document}
\end{document}

(Note there will be a small overhead due to starting up LaTeX twice.)

real    0m15.401s
user    0m15.297s
sys 0m0.084s

And for testhash25-setup:

real    0m4.218s
user    0m4.104s
sys 0m0.104s

Summing the run times of the separate computations and comparing with the joint computation, I get

  • For testhash1-setup, the sum of both "separate" computation times (usr timing) is 18.981s, while the joint computation takes 19.585s.
  • For testhash25-setup, the sum of both "separate" computation times is 7.788s, while the joint computation takes 10.585s.

Interestingly, this means that when the array key is encoded with \number, the following computation is not slowed down significantly, while when using my faster encoding method, the slowdown is on the order of the original running time of the TikZ example!

So this example actually allows to observe the effect of hash table cluttering: As my encoding tries to spread the hash codes for array elements across the whole hash table, the number of hash conflicts for regular macro names increases and indeed the performance is halved.

The results are comparable for the encoding by Bruno Le Floch.

Note however that the experiment setting is really not very realistic, because loading the array code before TikZ comes down to setting up data structures before defining the actual document macros, while usually it will be the other way round: First, macro packages and suchlike are loaded and later during document processing, data is stored in arrays which are defined on the fly.

So what happens when I move the \input{testhash25-setup} after the \usepackage{pgfplots}?

real    0m8.928s
user    0m8.809s
sys 0m0.100s

So the slowdown is still noticeable, but much less than when executing the other way round.

Hence, my conclusions from these experiments are:

  1. Extending the internal hash table alone doesn't slow TeX down.
  2. Cluttering up the hash table with macros for array elements does slow TeX down noticeably.
  3. It makes a difference when macros are defined; the basic packages should come first and the definition of array elements should come as late as possible.
  4. When deciding how to encode array indices, all effects should be taken into account. It makes no sense to trade slightly better hash performance for vastly slower overall run time.

We haven't even talked about arrays yet…

Indeed, so far all tests were at the "raw hash" level, just executing the pre-expanded store and retrieval commands. In a realistic application setting, of course the encoding of array keys has to take place at execution time.

That's the purpose of the macro \mkarrayexperiment in the test suite above: it will generate a test file containing statements like

\numhashstore{foo}{1}{1}
\mycount \numhashretrieve{foo}{438549}\relax 

storing and retrieving values to/from an array foo with the access macros defined above.

From the given test suite, I get the following results:

starting test "testnumhasharray"; size 100000; array implementation "numhasharr
ay" (./numhasharray.tex)) )
real    0m5.472s
user    0m5.192s
sys 0m0.048s

starting test "testhashnumhasharray"; size 100000; array implementation "hashnu
mhasharray" (./hashnumhasharray.tex)) )
real    0m3.754s
user    0m3.664s
sys 0m0.032s

starting test "testcodenumhasharray"; size 100000; array implementation "codenu
mhasharray" (./codenumhasharray.tex)) )
real    0m5.114s
user    0m4.868s
sys 0m0.048s

Here, we can see that the effort for encoding the array keys diminishes the distance between using \number and the hash-optimized approach of Bruno Le Floch. My own encoding is disqualified because the effort for generating the array keys outweighs the added hash efficience 🙁

But the good new is that even taking the time needed for encoding into account, the approach of Bruno Le Floch beats the array index based on \number.

Are there other approaches?

I'm interested to hear whether there are other approaches to implementing an array apart from the hash-based one studied here which have comparable performance.

The l3prop package offers an implementation of "property lists" which could be accessed like arrays. But with a test similar to the above, I got a performance which is totally incomparable to the hash-based implementation (100 times slower for 1000 array entries and 10000 retrievals; might even get worse for larger array sizes). So this doesn't seem to be an alternative for storing huge amounts of data.

I tried to devise a simple macro-based solution where all entries are stored in a single macro expansion, but unfortunately I couldn't think of any way for retrieving the value of an array entry with any kind of efficiency (I thought delimited arguments could be used for that, but couldn't figure out how).

But let me speculate a little: Let's talk about n array entries with a total content length of m. Conceivably, m>>n.

Then all solutions for storing the n entries I can imagine fall into two categories:

  1. New cs names on the order of n are created. This is the solution sketched above, but also, for instance, all kinds of dictionary implementations where nodes of a tree are stored as single macros.
  2. At least one macro whose expansion text is on the order of m is created.

Clearly, no solution based on the first approach will be more efficient than the simple implementation given above, because that one needs exactly one access to the internal hash table, and you can't go lower than that.

But I think that also no solution based on the second approach can be more efficient, because however you implement it, to retrieve an array element, you'll have to expand a macro whose expansion text is on the order of m, and that is clearly slower than a single access to the internal hash table, so there.

But maybe I'm just lacking imagination, so feel free to present a third solution, or in fact any array implementation more efficient than the simple hash-based one.

Final questions

Based on what I know so far, my basic question can be refined as follows:

  1. Looking at the actual implementation of TeXs hashing algorithm, what is the best way to construct cs names for maximum array performance?
  2. Is there any fundamentally different method of implementing an array in TeX which gives comparable (or even better) performance than the hash-based one?
  3. Am I somehow looking into the wrong direction? Maybe my experiment setup is somehow fundamentally flawed and I'm getting insignificant results? Should I test differently?
  4. I used the original TeX engine for my tests, and am not using any construct specific to any TeX extension. I tested the very same code with other engines, not giving any unexpected or significantly different results. It is rather clear that LuaTeX offers to implement arrays in Lua, hence giving completely different options. Are the other engines (pdftex, XeTeX) offering anything which might aid in implementing arrays?

Best Answer

So, first let me say that this is probably the most thoroughly researched question I've read so far. Congratulations.

I'll be reusing some of David's comments. The hashing algorithm takes the csname as an array of bytes (probably some differences there for XeTeX and LuaTeX, let me focus on 8-bit engines) and computes the sum of csname[i]*2^(len-i) % prime where csname[i] denotes the i-th byte in the csname, len is the total number of characters, and the result is computed modulo the hash prime. The control sequence is then stored or retrieved at the resulting position in the hash table. If there was never any hash collision, then storage and retrieval would be easy and fast, since computing the hash is very fast (additions only). When there is a collision, TeX must check whether the command that it found in the hash table is indeed the csname that was searched for: this is done by comparing strings one character at a time, and is somewhat slower than the hashing algorithm (I think). TeX may need to compare all csnames with the same hash (it can stop once it finds the csname it was looking for).

When comparing strings which happen to be different but have the same hash, TeX goes through them one character at a time, and can stop comparing as soon as it finds a difference: there lies the difference between a long prefix and a long suffix. In scenarios where there are many hash collisions (e.g., with \number), most of the time is spent comparing strings which happen to have the same hash. If such strings differ on an early character (i.e., short prefix), then they will be recognized as different pretty fast. If they have long common prefixes, then TeX will have to scan them for a long time (until the first difference) before it can declare them distinct.

Why is \number bad for hash collisions? Say that your array keys are 5-digit numbers. Compared to the hash of 00000, the hash of abcde (where a, etc, stand for digits) is 16*a+8*b+4*c+2*d+e. For instance, 98765 has the same hashcode as 00000 plus 16*9+8*8+4*7+2*6+5=253. This is tiny. In fact, hash codes for all sets of five digits all lie among 280 possible values! A typical hash value corresponds to more than 300 csnames, and on average TeX will have to test half of them before finding the one it wants.

How do we spread out hashes? The main culprit in having only 280 different values with \number is that Knuth chose a multiplier equal to 2 only. This is justified when names are more or less random sequences of letters of varying length, as in typical TeX documents (adding one character multiplies the whole hash value by 2, so collisions between strings of different lengths are not particularly likely regardless of their internal structure). To choose my encoding, I decided to replace 16*a+8*b+4*c+2*d+e above by 16^4 * a + 16^3 * b + 16^2 * c + 16 * d + e. This is simply done by inserting three dummy characters between each decimal digit, turning 98765 into 9---8---7---6---5. Why insert three rather than more? Because 16 is the smallest power of 2 greater than 10. Choosing 8 instead of 16 would still lead to hash collisions: the hashes of 0--0--0--0--0 and 9--9--9--9--9 are only 9*8^4+9*8^3+...+9=42129 apart, so a typical hash would correspond to two or three csnames.

The last step was simply to convert this to code:

\def\step#1{#1---\step}
\def\endstep#1\step{}
\message{\step 123 \endstep}

Here I use a typical technique to optimize loops: define the looping macro to do what you want and call itself again (here we want #1->#1---), then figure out what to feed as #1 to break the loop. This avoids any test hence is as fast as can be while supporting an arbitrary length of input. Note also that I am putting dashes after the digit (#1--- instead of ---#1) in an effort to put relevant stuff early in the csname.

For the specific application of inserting stuff within numbers, we might be able to do better (haven't benchmarked): numbers will not have more than 10 digits (actually 6 might be a reasonable bound, so remove #6 through #9). The following should work and may be faster

\def\dashes#1#2#3#4#5#6#7#8#9{#1---#2---#3---#4---#5---#6---#7---#8---#9---}
\message{\dashes 123********}

(at least 8 stars after the number, to support numbers of arbitrary lengths between 1 and 10).

Yet another option is to use \romannumeral on each digit, which reduces hash collisions because lengths vary. I don't know how this fares.

One thing is certain: csnames of the form <prefix><varying><postfix> of a constant length can have 256*2^(len-1) different hashes (where len is the length of the varying part), and more realistically 10*2^(len-1) hashes only. Getting 10^5 distinct hashes absolutely requires 10 varying bytes, or some variation on the length of the prefix, varying or postfix parts. In practical settings (where key encoding is taken into account, hence where it is difficult to get more than 10 different leading bytes), one needs about 15 bytes of varying material. My \step approach uses 20 (or 17, depending on whether you count the trailing ---). My \dashes approach may use various numbers depending on whether it is tailored to a specific number length.


Here are ways to implement arrays, depending on your use case. First, let us focus on arrays which can contain arbitrary tokens.

  • \toks registers. If you are defining and using your array in a controlled environment, use \toks registers within a group. This is limited to 32768 registers (65536 for LuaTeX, I think). Access is faster than csnames, and there is no need for tricks. In fact, this technique can also be used if your array is defined somewhere else in another form, but you need to access it a lot at a given place (within controlled code). This (is|has been) used in l3sort for merge sort.

  • Multiple \csname approach. This is what you do, and has been discussed at length here. There is no limit in the number of slots, and keys may be other than numeric, but for numeric keys some trickery is needed to avoid hash collisions. Also, you cannot have many such arrays in memory at the same time lest you fill your hash table completely.

  • A single \csname containing everything. This is essentially what l3prop does at the time of writing. Mapping through all keys is straightforward and fast (per element). Accessing individual elements in an arbitrary order will be very slow, since the whole array needs to be expanded from a macro every time we want to access one element.

  • Hybrid approaches may be better if your access to given items happens in clusters (e.g., you want to look at keys 1–2000, then 1000–3000, then 2000–4000): you could allocate a certain range of \toks registers for this particular purpose and update once in a while which items are stored in this range. You could also store most stuff in a single csname and only send some items to individual csnames, or toks registers.

Another type of array has numeric values, be it integers, dimensions, and the like.

  • Various registers can be used in the same way as \toks registers above. A typical example of this is l3regex, which at the time of writing abuses toks, dim, skip, and muskip registers extensively to store all sorts of information. Do not use integer registers, as stuff like \m@ne would break, but expect other stuff to break too (\maxdimen,...). Using dimension or skip or muskip registers can get messy when there is a need to convert from one to the other: of course, we use stretch and shrink components too, which leads to \number\glueshrink\mutoglue\muskip3\relax-type constructions.

  • A more practical set of numbers is given by the \fontdimen parameters of a font. In effect, those behave like a usual C array whose size must be declared before starting to use it, and whose items are TeX dimens, i.e., integers less than 2^30 in magnitude. All assignments are global. Retrieval speed is comparable to other registers, perhaps a bit slower. See a question I asked on this topic for some other comments.

Of course, the toks, and csname approaches described earlier work fine for numeric data too, but it seems inefficient compared to \fontdimen, especially concerning memory usage (each \fontdimen parameter uses just a few bytes of memory, versus many bytes for the csname and many more for tokens representing the value).

On a similar topic of optimizing data structures:

  • A long string of characters is stored more efficiently as a csname than as a list of tokens. This has been used at some point in l3fp, storing 10000 decimals of 1/(2pi) as \159154943091895..., to be passed to \string when needed.

  • Trees which only contain numeric data and only need to be accessed non-expandably can be stored and manipulated as nested boxes, with data stored as various kern nodes for instance. I played with this to implement sorting algorithms some time ago, but the code was messier than the current l3sort and not significantly faster. I'll look at it again some day. It might be possible to reuse this for the relationship between document elements.

Related Question