[Tex/LaTex] Ligature suppression with proper hyphenation and comprehensive word list

discretionaryhyphenationligaturestypography

[Update: It's been pointed out that my question(s) here weren't clear (or were too hidden). So I inserted red stars at the points where the questions appear.]

In The TeX Book, D.E.K. gives "shelfful" as an example of a word in which
automatic ligature creation should be suppressed. In the answer to Exercise 5.1, he suggests writing it as {shelf}ful, shelf{}ful, shelf\/ful, or shelf{\kern0pt}ful.

I'm not actually satisfied with any of those solutions, and I'll explain why not below. star I'm looking for a more robust alternative.


Manual methods of ligature suppression?

Four solutions are given in The TeX Book :

  • {shelf}ful and shelf{}ful — These two produce identical results, and — as Knuth points out — TeX will reinsert the ff ligature by itself after hyphenating the word, since it contains no explicit kerns. Words written this way are sometimes hyphenated, but not often enough to keep paragraphs properly justified without resorting to \emergencykern, and never at the most logical place. Worse yet, a word like "cufflink" is incorrectly hyphenated as "cuf-flink"!

  • shelf\/ful — This also has two severe problems. First, words written this way can almost never be hyphenated (a word like "childproofing" is nicely hyphenated as "child-proofing", but a word like "elflike" has nowhere to be split because the italic correction does not allow it to be hyphenated as "elf-like". Second, there is far too much separation between the syllables, a point which Knuth acknoweldges.

  • shelf{\kern0pt}ful — This produces just as poor results as the italic correction method. Very few words can be hyphenated this way, and again only at points other than the most logical place.

The shortcomings of the above have led me to look for alternatives. Below are two different solutions that I feel are acceptable, yet star I wonder if these are considered best practice or if there is something still better:

  • shelf\-ful — This actually seems to work quite well. The discretionary hyphen allows all such words to be hyphenated and — most importantly — at the most logical location. In fact, an eyesore like "shelfful" almost looks better hyphenated at the end of a line than it does unhyphenated in the middle of a line.

  • shelf\discretionary{-}{}{\kern.033333em}ful — This is a slight modification on the previous form, in order to insert ¹⁄₃₀ em space between the constituent characters of the suppressed ligature in cases where the word appears unhyphenated. This is currently my favorite solution.

So those are the ways I know of. But actually, I prefer not to think about it all, and I'm frustrated that TeX doesn't have a built-in list of exceptions for English. Like hyphenation, this is something that I feel the computer should do automatically. (It could certainly be an external file updated periodically as part of popular TeX distributions.)

I found the TeX.SX discussion "Can one (more or less automatically) suppress ligatures for certain words?" and was very happy to see that people are working on this very problem.


Comprehensive list of ligature exceptions for English?

I thought I would be able to find a comprehensive list somewhere of exceptions like "shelfful." I probably spent a good half hour looking. I found a few places listing 10 or 20 words, but nothing with the hundreds of words I might expect to see. star Does anyone know of such a list?

So last night I took a list of 500,000 English words and wrote a program to run an analysis and produce a list of words suspected to be problematic, based on relative frequencies of prefixes, suffixes, and subwords. This worked well, but produced a number of false positives such as "office" (which it suspected was a compound word formed from the words "off" and "ice") and "beeflower" (which it suspected was a compound word formed from the words "beef" and "lower"; it's actually a compound word based on the words "bee" and "flower").

Consequently, I pared down the resulting list of ~1,500 candidates manually to ~500 words, for example:

chaffinches     chaf-finches
cufflinks       cuff-links
dwarflike       dwarf-like
halflife        half-life
offline         off-line
selfish         self-ish
shelfful        shelf-ful
woofing         woof-ing

Depending on how one feels about suffixes like -ing, -ish, -ier, -iest, -iness, -ily, -ly, and so forth, the list could be pruned down further, probably to 200 words. Consideration of the suffix -ish is important to avoid words like "selfish" and "wolfish" from looking like "sel-fish" and "wol-fish" (the latter especially — since "wolffish" is also a word).

Anyway, I would like to take the list I made and submit it to someone maintaining a comprehensive list (star who?), and in return I'd like to get a copy of what they have. My list feels pretty good, but I'm sure it isn't complete. For instance, I just noticed that "shelflife" wasn't in it because it didn't appear in the dictionary file I used.


Other amusing words

I wrote the program so that it could identify any suspicious letter sequences — not just ff, fi, ffi, fl, and fl — and I found it interesting to look at a few other letter combinations — just for fun.

This is just a small sample of the thousands of other words identified:

ft

halftime        half-time
offtrack        off-track
rooftop         roof-top

fh

halfhearted     half-hearted
offhand         off-hand
serfhood        serf-hood

st

crosstalk       cross-talk
dogstail        dogs-tail
duststorm       dust-storm
poststrike      post-strike

ct

arctangent      arc-tangent
arctic          arc-tic

kn

hawknosed       hawk-nosed
weeknight       week-night

ph

loophole        loop-hole
scrapheap       scrap-heap
stamphead       stamp-head

th

boathooks       boat-hooks
footholds       foot-holds
goatherd        goat-herd
porthole        port-hole
warthog         wart-hog

sh

gashouse        gas-house
horseshit       horse-shit
mishap          mis-hap
newshawk        news-hawk

tr

hatrack         hat-rack
outrace         out-race
postrace        post-race
postriot        post-riot

wh

knowhow         know-how
sparrowhawk     sparrow-hawk

au

ultraugly       ultra-ugly

ea

readmit         re-admit

ie

antieconomic    anti-economic
antielite       anti-elite
dielectric      di-electric

oi

coincide        co-incide
coinmate        co-inmate
coinsure        co-insure

dd

granddad        grand-dad
guarddog        guard-dog
headdress       head-dress

ee

paleethnology   pale-ethnology
preenable       pre-enable
preescape       pre-escape
reedit          re-edit
reelect         re-elect
reemit          re-emit
reenable        re-enable
reencounter     re-encounter
reentry         re-rentry

gg

doggone         dog-gone

hh

archhead        arch-head
bathhouse       bath-house
fishhold        fish-hold
methhead        meth-head
withhold        with-hold

ll

schoollike      school-like
soulless        soul-less
taillight       tail-light
wheelless       wheel-less

mm

bottommost      bottom-most
filmmaker       film-maker
teamman         team-man

nn

humanness       human-ness
nonnational     non-national
penname         pen-name
swannecked      swan-necked

oo

cooccur         co-occur
cooperate       co-operate
coopt           co-opt
proode          pro-ode
pseudoorganic   pseudo-organic

pp

dampproof       damp-proof
lamppost        lamp-post
slipproof       slip-proof

rr

interradial     inter-radial
overran         over-ran
overrich        over-rich
underripe       under-ripe

ss

newsstand       news-stand
hisself         his-self

tt

cattail         cat-tail
coattail        coat-tail
nighttime       night-time
ofttimes        oft-times
outtakes        out-takes
outthink        out-think
shirttail       shirt-tail
shitton         shit-ton

ww

glowworm        glow-worm
sawworker       saw-worker
showworthy      show-worthy
yellowwood      yellow-wood

I wouldn't go so far as to insert micro-spacing in these other cases, but I find these interesting because of the way the mind can be tricked momentarily while parsing the letters and phonemes. Words like "cooccur" and "coinmate" and "reemit" really make my brain hurt!

Best Answer

Something at the back of my mind told me Barbara had a similar list, google helped out with some links here

http://comments.gmane.org/gmane.comp.tex.live/14987