[Tex/LaTex] Alternatives or Successor for TeX

tex-core

Possible Duplicate:
Why are there no alternatives to TeX, or, why is TeX still used?

My problem with the current version of (La-)TeX is that it's pretty brittle. Some documents run well on different systems, others don't work at all. It's hard to make sure the necessary fonts and macros are available. Some flavors of TeX support TTF fonts, others don't. Installing fonts for TeX either works or it doesn't – if it doesn't work, there is little that John Doe can do.

Distributions like MiKTeX and TeX Live try to make this a little bit less painful but they also don't auto-heal (like automatically downloading macros needed by a document), error messages are often completely useless to anyone but a TeX pro.

So I understand that this is a heretical question here but: Is there a successor for TeX? Or an alternative?

There was an effort to rewrite TeX in a more model programming language like Java. What happened to that? Or to use XML as input format to get rid of the odd bugs introduced by escaping and quoting?

Best Answer

The alternative to TeX is Microsoft Word. Good luck with that.

My problem with the current version of (La-)TeX is that it's pretty brittle. Some documents run well on different systems, others don't work at all.

Backwards compatibility of Word documents breaks after a few release cycles. LaTeX has been stable for 25+ years.

It's hard to make sure the necessary fonts and macros are available.

Word doesn't search for fonts either. That's a operating system issue.

It's not hard to make macros available. At the very least all you have to do is put the package file in the same directory as the document file. Every distribution also comes with a default user location to put macro files.

Some flavors of TeX support TTF fonts, others don't.

So...use the ones which do?

Installing fonts for TeX either works or it doesn't - if it doesn't work, there is little that John Doe can do.

On my machine I double-click a TTF font and the OS installs it in my font library. XeLaTeX finds that font with zero problem.

Distributions like MiKTeX and TeX Live try to make this a little bit less painful but they also don't auto-heal (like automatically downloading macros needed by a document),

There aren't macros in word files like in TeX files, but Word doesn't auto-heal either. If the file uses fonts that aren't on the machine, it won't get them automatically.

error messages are often completely useless to anyone but a TeX pro.

The h command at a message provides the best possible guess to the source of an error. But the same objection could be given to many languages.

There was an effort to rewrite TeX in a more model programming language like Java. What happened to that? Or to use XML as input format to get rid of the odd bugs introduced by escaping and quoting?

By "odd bugs introduced by escaping and quoting" you mean in users' input files? That's not the fault of TeX, and changing the input file format is not going to fix PEBKAC errors. The TeX developers have fixed 9 bugs in 10 years.

Maybe this is a bit of a strawman argument since you didn't propose Word as a substitute, but I don't think your objections are unique to *TeX or warrant replacement.

Related Solutions

[Tex/LaTex] Tabs versus spaces for indenting TeX documents

What is a "correct" way to indent TeX documents?

I believe there is no "correct" way -- certainly no single correct way -- to indent input lines of an ordinary TeX or LaTeX document. (The situation may be different for input that adheres to LaTeX3 coding methods, but I'm not sure of that.) As long as TeX can parse the input, it's completely agnostic as to the form of the input it receives. Any indentation practices -- whether indenting with "soft tabs" or "hard" spaces, and whether the indentation amounts to 2, 4, or 8 spaces -- are therefore entirely cosmetic and voluntary in nature. To be sure, I don't believe anyone in their right mind would advocate not using some rule for indenting parts of one's TeX code.

A separate thought: In order to raise the readability of TeX input code, a technique that is probably as important as (or possibly even more important than) adopting some rule for indentation is using vertical whitespace liberally and consistently -- between paragraphs, above a sectioning command, before and after a \newenvironment statement, before and after a LaTeX float, etc.

[Tex/LaTex] How to implement (low-level) arrays in TeX

So, first let me say that this is probably the most thoroughly researched question I've read so far. Congratulations.

I'll be reusing some of David's comments. The hashing algorithm takes the csname as an array of bytes (probably some differences there for XeTeX and LuaTeX, let me focus on 8-bit engines) and computes the sum of csname[i]*2^(len-i) % prime where csname[i] denotes the i-th byte in the csname, len is the total number of characters, and the result is computed modulo the hash prime. The control sequence is then stored or retrieved at the resulting position in the hash table. If there was never any hash collision, then storage and retrieval would be easy and fast, since computing the hash is very fast (additions only). When there is a collision, TeX must check whether the command that it found in the hash table is indeed the csname that was searched for: this is done by comparing strings one character at a time, and is somewhat slower than the hashing algorithm (I think). TeX may need to compare all csnames with the same hash (it can stop once it finds the csname it was looking for).

When comparing strings which happen to be different but have the same hash, TeX goes through them one character at a time, and can stop comparing as soon as it finds a difference: there lies the difference between a long prefix and a long suffix. In scenarios where there are many hash collisions (e.g., with \number), most of the time is spent comparing strings which happen to have the same hash. If such strings differ on an early character (i.e., short prefix), then they will be recognized as different pretty fast. If they have long common prefixes, then TeX will have to scan them for a long time (until the first difference) before it can declare them distinct.

Why is \number bad for hash collisions? Say that your array keys are 5-digit numbers. Compared to the hash of 00000, the hash of abcde (where a, etc, stand for digits) is 16*a+8*b+4*c+2*d+e. For instance, 98765 has the same hashcode as 00000 plus 16*9+8*8+4*7+2*6+5=253. This is tiny. In fact, hash codes for all sets of five digits all lie among 280 possible values! A typical hash value corresponds to more than 300 csnames, and on average TeX will have to test half of them before finding the one it wants.

How do we spread out hashes? The main culprit in having only 280 different values with \number is that Knuth chose a multiplier equal to 2 only. This is justified when names are more or less random sequences of letters of varying length, as in typical TeX documents (adding one character multiplies the whole hash value by 2, so collisions between strings of different lengths are not particularly likely regardless of their internal structure). To choose my encoding, I decided to replace 16*a+8*b+4*c+2*d+e above by 16^4 * a + 16^3 * b + 16^2 * c + 16 * d + e. This is simply done by inserting three dummy characters between each decimal digit, turning 98765 into 9---8---7---6---5. Why insert three rather than more? Because 16 is the smallest power of 2 greater than 10. Choosing 8 instead of 16 would still lead to hash collisions: the hashes of 0--0--0--0--0 and 9--9--9--9--9 are only 9*8^4+9*8^3+...+9=42129 apart, so a typical hash would correspond to two or three csnames.

The last step was simply to convert this to code:

\def\step#1{#1---\step}
\def\endstep#1\step{}
\message{\step 123 \endstep}

Here I use a typical technique to optimize loops: define the looping macro to do what you want and call itself again (here we want #1->#1---), then figure out what to feed as #1 to break the loop. This avoids any test hence is as fast as can be while supporting an arbitrary length of input. Note also that I am putting dashes after the digit (#1--- instead of ---#1) in an effort to put relevant stuff early in the csname.

For the specific application of inserting stuff within numbers, we might be able to do better (haven't benchmarked): numbers will not have more than 10 digits (actually 6 might be a reasonable bound, so remove #6 through #9). The following should work and may be faster

\def\dashes#1#2#3#4#5#6#7#8#9{#1---#2---#3---#4---#5---#6---#7---#8---#9---}
\message{\dashes 123********}

(at least 8 stars after the number, to support numbers of arbitrary lengths between 1 and 10).

Yet another option is to use \romannumeral on each digit, which reduces hash collisions because lengths vary. I don't know how this fares.

One thing is certain: csnames of the form <prefix><varying><postfix> of a constant length can have 256*2^(len-1) different hashes (where len is the length of the varying part), and more realistically 10*2^(len-1) hashes only. Getting 10^5 distinct hashes absolutely requires 10 varying bytes, or some variation on the length of the prefix, varying or postfix parts. In practical settings (where key encoding is taken into account, hence where it is difficult to get more than 10 different leading bytes), one needs about 15 bytes of varying material. My \step approach uses 20 (or 17, depending on whether you count the trailing ---). My \dashes approach may use various numbers depending on whether it is tailored to a specific number length.

Here are ways to implement arrays, depending on your use case. First, let us focus on arrays which can contain arbitrary tokens.

\toks registers. If you are defining and using your array in a controlled environment, use \toks registers within a group. This is limited to 32768 registers (65536 for LuaTeX, I think). Access is faster than csnames, and there is no need for tricks. In fact, this technique can also be used if your array is defined somewhere else in another form, but you need to access it a lot at a given place (within controlled code). This (is|has been) used in l3sort for merge sort.
Multiple \csname approach. This is what you do, and has been discussed at length here. There is no limit in the number of slots, and keys may be other than numeric, but for numeric keys some trickery is needed to avoid hash collisions. Also, you cannot have many such arrays in memory at the same time lest you fill your hash table completely.
A single \csname containing everything. This is essentially what l3prop does at the time of writing. Mapping through all keys is straightforward and fast (per element). Accessing individual elements in an arbitrary order will be very slow, since the whole array needs to be expanded from a macro every time we want to access one element.
Hybrid approaches may be better if your access to given items happens in clusters (e.g., you want to look at keys 1–2000, then 1000–3000, then 2000–4000): you could allocate a certain range of \toks registers for this particular purpose and update once in a while which items are stored in this range. You could also store most stuff in a single csname and only send some items to individual csnames, or toks registers.

Another type of array has numeric values, be it integers, dimensions, and the like.

Various registers can be used in the same way as \toks registers above. A typical example of this is l3regex, which at the time of writing abuses toks, dim, skip, and muskip registers extensively to store all sorts of information. Do not use integer registers, as stuff like \m@ne would break, but expect other stuff to break too (\maxdimen,...). Using dimension or skip or muskip registers can get messy when there is a need to convert from one to the other: of course, we use stretch and shrink components too, which leads to \number\glueshrink\mutoglue\muskip3\relax-type constructions.
A more practical set of numbers is given by the \fontdimen parameters of a font. In effect, those behave like a usual C array whose size must be declared before starting to use it, and whose items are TeX dimens, i.e., integers less than 2^30 in magnitude. All assignments are global. Retrieval speed is comparable to other registers, perhaps a bit slower. See a question I asked on this topic for some other comments.

Of course, the toks, and csname approaches described earlier work fine for numeric data too, but it seems inefficient compared to \fontdimen, especially concerning memory usage (each \fontdimen parameter uses just a few bytes of memory, versus many bytes for the csname and many more for tokens representing the value).

On a similar topic of optimizing data structures:

A long string of characters is stored more efficiently as a csname than as a list of tokens. This has been used at some point in l3fp, storing 10000 decimals of 1/(2pi) as \159154943091895..., to be passed to \string when needed.
Trees which only contain numeric data and only need to be accessed non-expandably can be stored and manipulated as nested boxes, with data stored as various kern nodes for instance. I played with this to implement sorting algorithms some time ago, but the code was messier than the current l3sort and not significantly faster. I'll look at it again some day. It might be possible to reuse this for the relationship between document elements.

Best Answer

Related Solutions

[Tex/LaTex] Tabs versus spaces for indenting TeX documents

[Tex/LaTex] How to implement (low-level) arrays in TeX

Related Question