[Tex/LaTex] How are hyphenation patterns written

hyphenation

In my miktex folder I checked the hyph-en-us.tex file and found letter patterns written like this:

\patterns{    
.ach4    
.ad4der    
.af1t    
...
}

What do these dots and numbers mean?

Best Answer

The full explanation can be found in Appendix H of the TeXbook. When TeX considers a word for hyphenation, it splits it into “subwords”. The example in the TeXbook uses “hyphenation”. Markers are added at either end, represented by ., and subwords are considered:

. h y p h e n a t i o n .
.h hy yp ph he en na at ti io on n.
.hy hyp yph phe hen ena nat ati tio ion on.
.hyp [...] ion.

and so on, of any possible length. Each subword is compared with the hyphenation patterns for the current language. A pattern consists of sequences of the form

<digit><letter><digit>...<letter><digit>

where the <digit> is omitted if it is 0; so the first pattern in the list for English is equivalent to

0.0a0c0h4

The comparison, of course, doesn't consider the digits. Subwords that match no pattern are discarded; in the example only

0h0y3p0h0 0h0e2n0 0h0e0n0a4 0h0e0n5a0t0 1n0a0 0n2a0t0 1t0i0o0 2i0o0 0o2n0

survive. Between each letter in the original word, the maximum value in the list above remains (with 0 by default), so the next step results in

.0h0y3p0h0e2n5a4t2i0o2n0.

The feasible hyphenation points are those where an odd digit appears, in this case

hy-phen-ation

So the first pattern basically prohibits hyphenating words that start with “ach” after the h, the second one avoids breaking between the two d's words that start with “adder”, because of the large even value; on the contrary, the next pattern says that breaking after f a word starting with “aft” is possible, unless some other pattern is found that has an even value between f and t. The pattern .anti5s says that words starting with “antis” can be broken after i (no value above 5 appears in hyphen.tex). An even value can be used in a longer pattern to countermand an odd value in a shorter pattern and so on.

Break points that fall at a distance from the boundaries less than \lefthyphenmin or \righthyphenmin (left and right boundary respectively) are discarded. Since for English we have \lefthyphenmin=2 and \righthyphenmin=3, the two found points are kept. Now TeX adds discretionary items and considers them for breaking the paragraph into lines.

What's a “word”? Basically (but this is not the full truth) it is a sequence of letters in the same font that follows a space. Consult the TeXbook or TeX by Topic for more information.

How are the patterns prepared? It depends. For English, the program patgen has been used: it loads a list of hyphenated words and spits out a list of patterns. For Italian, grammatical rules have been used, with low even or odd values (so we find s2c, s2p and so on, or b1b, c1c and so on); other patterns have been manually added for avoiding bad breaks, for instance .di2s3cine that gives the correct hyphenation for “discinesia”.

Related Question