[Update: It's been pointed out that my question(s) here weren't clear (or were too hidden). So I inserted red stars at the points where the questions appear.]
In The TeX Book, D.E.K. gives "shelfful" as an example of a word in which
automatic ligature creation should be suppressed. In the answer to Exercise 5.1, he suggests writing it as {shelf}ful
, shelf{}ful
, shelf\/ful
, or shelf{\kern0pt}ful
.
I'm not actually satisfied with any of those solutions, and I'll explain why not below. I'm looking for a more robust alternative.
Manual methods of ligature suppression?
Four solutions are given in The TeX Book :
-
{shelf}ful
andshelf{}ful
— These two produce identical results, and — as Knuth points out — TeX will reinsert the ff ligature by itself after hyphenating the word, since it contains no explicit kerns. Words written this way are sometimes hyphenated, but not often enough to keep paragraphs properly justified without resorting to\emergencykern
, and never at the most logical place. Worse yet, a word like "cufflink" is incorrectly hyphenated as "cuf-flink"! -
shelf\/ful
— This also has two severe problems. First, words written this way can almost never be hyphenated (a word like "childproofing" is nicely hyphenated as "child-proofing", but a word like "elflike" has nowhere to be split because the italic correction does not allow it to be hyphenated as "elf-like". Second, there is far too much separation between the syllables, a point which Knuth acknoweldges. -
shelf{\kern0pt}ful
— This produces just as poor results as the italic correction method. Very few words can be hyphenated this way, and again only at points other than the most logical place.
The shortcomings of the above have led me to look for alternatives. Below are two different solutions that I feel are acceptable, yet I wonder if these are considered best practice or if there is something still better:
-
shelf\-ful
— This actually seems to work quite well. The discretionary hyphen allows all such words to be hyphenated and — most importantly — at the most logical location. In fact, an eyesore like "shelfful" almost looks better hyphenated at the end of a line than it does unhyphenated in the middle of a line. -
shelf\discretionary{-}{}{\kern.033333em}ful
— This is a slight modification on the previous form, in order to insert ¹⁄₃₀ em space between the constituent characters of the suppressed ligature in cases where the word appears unhyphenated. This is currently my favorite solution.
So those are the ways I know of. But actually, I prefer not to think about it all, and I'm frustrated that TeX doesn't have a built-in list of exceptions for English. Like hyphenation, this is something that I feel the computer should do automatically. (It could certainly be an external file updated periodically as part of popular TeX distributions.)
I found the TeX.SX discussion "Can one (more or less automatically) suppress ligatures for certain words?" and was very happy to see that people are working on this very problem.
Comprehensive list of ligature exceptions for English?
I thought I would be able to find a comprehensive list somewhere of exceptions like "shelfful." I probably spent a good half hour looking. I found a few places listing 10 or 20 words, but nothing with the hundreds of words I might expect to see. Does anyone know of such a list?
So last night I took a list of 500,000 English words and wrote a program to run an analysis and produce a list of words suspected to be problematic, based on relative frequencies of prefixes, suffixes, and subwords. This worked well, but produced a number of false positives such as "office" (which it suspected was a compound word formed from the words "off" and "ice") and "beeflower" (which it suspected was a compound word formed from the words "beef" and "lower"; it's actually a compound word based on the words "bee" and "flower").
Consequently, I pared down the resulting list of ~1,500 candidates manually to ~500 words, for example:
chaffinches chaf-finches
cufflinks cuff-links
dwarflike dwarf-like
halflife half-life
offline off-line
selfish self-ish
shelfful shelf-ful
woofing woof-ing
Depending on how one feels about suffixes like -ing, -ish, -ier, -iest, -iness, -ily, -ly, and so forth, the list could be pruned down further, probably to 200 words. Consideration of the suffix -ish is important to avoid words like "selfish" and "wolfish" from looking like "sel-fish" and "wol-fish" (the latter especially — since "wolffish" is also a word).
Anyway, I would like to take the list I made and submit it to someone maintaining a comprehensive list ( who?), and in return I'd like to get a copy of what they have. My list feels pretty good, but I'm sure it isn't complete. For instance, I just noticed that "shelflife" wasn't in it because it didn't appear in the dictionary file I used.
Other amusing words
I wrote the program so that it could identify any suspicious letter sequences — not just ff, fi, ffi, fl, and fl — and I found it interesting to look at a few other letter combinations — just for fun.
This is just a small sample of the thousands of other words identified:
ft
halftime half-time
offtrack off-track
rooftop roof-top
fh
halfhearted half-hearted
offhand off-hand
serfhood serf-hood
st
crosstalk cross-talk
dogstail dogs-tail
duststorm dust-storm
poststrike post-strike
ct
arctangent arc-tangent
arctic arc-tic
kn
hawknosed hawk-nosed
weeknight week-night
ph
loophole loop-hole
scrapheap scrap-heap
stamphead stamp-head
th
boathooks boat-hooks
footholds foot-holds
goatherd goat-herd
porthole port-hole
warthog wart-hog
sh
gashouse gas-house
horseshit horse-shit
mishap mis-hap
newshawk news-hawk
tr
hatrack hat-rack
outrace out-race
postrace post-race
postriot post-riot
wh
knowhow know-how
sparrowhawk sparrow-hawk
au
ultraugly ultra-ugly
ea
readmit re-admit
ie
antieconomic anti-economic
antielite anti-elite
dielectric di-electric
oi
coincide co-incide
coinmate co-inmate
coinsure co-insure
dd
granddad grand-dad
guarddog guard-dog
headdress head-dress
ee
paleethnology pale-ethnology
preenable pre-enable
preescape pre-escape
reedit re-edit
reelect re-elect
reemit re-emit
reenable re-enable
reencounter re-encounter
reentry re-rentry
gg
doggone dog-gone
hh
archhead arch-head
bathhouse bath-house
fishhold fish-hold
methhead meth-head
withhold with-hold
ll
schoollike school-like
soulless soul-less
taillight tail-light
wheelless wheel-less
mm
bottommost bottom-most
filmmaker film-maker
teamman team-man
nn
humanness human-ness
nonnational non-national
penname pen-name
swannecked swan-necked
oo
cooccur co-occur
cooperate co-operate
coopt co-opt
proode pro-ode
pseudoorganic pseudo-organic
pp
dampproof damp-proof
lamppost lamp-post
slipproof slip-proof
rr
interradial inter-radial
overran over-ran
overrich over-rich
underripe under-ripe
ss
newsstand news-stand
hisself his-self
tt
cattail cat-tail
coattail coat-tail
nighttime night-time
ofttimes oft-times
outtakes out-takes
outthink out-think
shirttail shirt-tail
shitton shit-ton
ww
glowworm glow-worm
sawworker saw-worker
showworthy show-worthy
yellowwood yellow-wood
I wouldn't go so far as to insert micro-spacing in these other cases, but I find these interesting because of the way the mind can be tricked momentarily while parsing the letters and phonemes. Words like "cooccur" and "coinmate" and "reemit" really make my brain hurt!
Best Answer
Something at the back of my mind told me Barbara had a similar list, google helped out with some links here
http://comments.gmane.org/gmane.comp.tex.live/14987