[Tex/LaTex] How does TeX’s hyphenation algorithm work

hyphenationlanguages

Tex uses a built-in internal algorithm to decide where words can be hyphenated. This algorithm sometimes fails, as discussed in the questions Breaking words at the end of line and How to manually set where a word is split?. (There's also an online list of known algorithm failures.) How does the algorithm work, in broad strokes? I know it's language-dependent, so for concreteness let's say the American and/or British English algorithms.

Best Answer

The algorithm is not language dependent, but the data used is dependent on the language.

There are two basic components, a list of hyphenation exceptions some of which are specified in the language definition and others can be added at any time in a document, if you go \hyphenation{one-tw-o-thr-ee} then that word (and upper/lowercase variants) will be hyphenated as shown, note no other linguistic variants such as plurals are affected by this. if you want "onetwothrees" to be hyphenated in a similar way that would also need to be listed.

Hyphenation exceptions are useful for special words and give total control in the document but clearly just listing every word in the language isn't realistic so the main mechanism is patterns

For each language the format inputs a file that executes \patterns. The original US english ones being at a location such as

/usr/local/texlive/2017/texmf-dist/tex/generic/hyphen/hyphen.tex

and looking like

\patterns{
.ach4
.ad4der
.af1t
.al3t
.am5at
.an5c
  four thousand more of these lines

If you ignore the digits, each of these runs of letters is matched against the words in the paragraph (. meaning start or end of a word). For each word any pattern that matches a substring assigns a digit 0-9 between the letters of a word (no digit being the same as 0). If two or more of these patterns match a word, the highest valued digit is assigned to each inter-letter space.

So after all patterns have been matched against a word there is a value 0-9 assigned between each letter. If this value is odd then hyphenation is allowed at that point, if it is even no hyphenation is allowed at that point.

There are additional integer parameters that specify how close to the start or end of a word a hyphen may be placed.

TeX also uses some clever optimisations that mean it does not have to pattern match every word, it only needs to find the hyphenation points in the words that could be a feasible break point in a paragraph, but that's an internal optimisation that doesn't affect the basic hyphenation algorithm.

For some languages that have regular spelling and hyphenation rules, the patterns can be hand written to reflect those rules. English defeats description by rules so for cases like this patterns are usually made by taking an existing dictionary of hyphenated words (eg as supplied by a publisher), and using the patgen program to compress the dictionary by producing a set of patterns that produces (say) 80% of the hyphens in the original dictionary.

Related Question