[Tex/LaTex] Obtain \badness or glue adjustment for each line

badnesshyphenationline-breakingoutputspacing

This might be interesting for perfectionists and/or fastidious typesetters who would like to improve the document even further (beyond the beyond the magnificence of a book with zero bad boxes).

We all know that the hyphenation algorithm, as conceived by Mr. Franklin Mark Liang and implemented in the patgen programme, is based on the processing of a whole bunch of pre-hyphenated words, calculating the likelihood of a permitted break, building a compact table for the sake of space and memory efficiency etc. It may correctly identify up to 90% possible breaks depending on the language. However, given that the computers of the present day are no longer constrained by the limitations of ’82, the hyphenation in *TeX output can be further improved in two ways:

  1. We can create a comprehensive hyphenation database for each language and get the 100% accuracy.
  2. We can deal with the problem if it appears: analyse the logged report, find any occurrences of overfulls and, if their roots stretch back to undiscovered allowed hyphenation point, manually add the word to the “white list” of \hyphenation{...}.

While in English language this doesn’t seem to be the problem owing to the abundance of short words, in Russian and German it is a frequent case: sometimes I get 5 overfull boxes only due to the missed hyphenation point, and after some hard-coding similar to \hyphenation{ми-н-да-лём ра-с-по-ря-ди-те-лю мо-ж-но}, all the bad boxes are gone. As a typesetter of Russian texts, I can assure you that they look perfect after TeX with [russian]{babel} that beautifully handles all the diverse punctuation, but the practice of tying (~) one-letter words (and ideally some two-letter auxiliaries) to the following word is begloomed by some unfound hyphenation points, and overfulls ensue. Since many-many books are compact in size, the text area is often limited to 100×175 mm, or even smaller. Believe me, this is a real challenge for a typesetter of cyrillic texts.

Problem in one sentence: a missed hyphenation point causes some inferior line breaking or one close to æsthetically unacceptable, and after manual hyphenation is introduced, a new breakpoint is used, improving the breaking; however, manual “leak plugging” is unfunny.

Any additional non-breakable space (~) is a restriction, and we all know that mathematically, it cannot decrease the general “badness” of a paragraph. Any additional condition is a compromise that causes the conditional minimum obtained through the minimisation of the cubic function to grow in comparison to what it would have been if the restriction had not been imposed (same for regression analysis: the restricted sum of squared residuals is greater than the unrestricted one). The problem is aggravated by the fact that TeX does not report if the badness of a line does not exceed 1000 but comes very close to it.

It would be much nicer if I could track down and hunt, say, a line with badness 990 that is only that bad because a breakpoint of a word was not found by the hyphenation algorithm! It would be much nicer to see all the spots where the breakpoints were missed and add more degrees of freedom, thus improving the look (if a word has to be broken anyway, better give it a maximum number of breaks allowed by the rules of the language!).

This has driven me to the point of two possible ways of dealing with the problem:

  1. Make and compile a DIY modification of pdfLaTeX that would report every occurence of \badness exceeding X (say 700) in every line in which a word had to be hyphenated, which is undoubtedly a dirty hack;
  2. Write an extension that would display the badness after each line (kind of “über-draft” mode that not only prints a black rectangle where the overflow has occurred, but rather report instances of the interword space being close to its maximum or minimum allowed value.

I thought that it might be possible in LuaTeX that the absolute amount of glue added to the standard interword space (3.33333pt plus 1.66666pt minus 1.11111pt, if I am correct) were printed in the margins. If it is possible in LuaTeX, then it can be pushed further to being user-friendly: the percentage of possible amount shrunk or expanded printed… and coloured (it’s LuaTeX, after all!). UPDATE: But obviously it uses different fonts and metrics, and such a solution would not help any of LaTeX typesetters, who, as I roughly estimate, make up a large majority of TeX users, and the proportion is not likely to waver.

If microtype package is enabled, the same question arises: can we obtain the stretch/shrink parameter values for each line of output? If the default limit is 20, then a value of 20 or −19 in a line in which a word break occurs may indicate that it could not find a hyphenation point and therefore had to resort to extreme expansion/compression.

Although the transition to full-size hyphenation dictionaries may be the most beautiful option in the long run (assuming that the complexity of the hyphenation search algorithm does not exceed… say, O(n·log(n)), where n is the measure of the dictionary size), all I want for now is the ascertainment of the possibility of printing/storing the badness of each line and/or the exact amount of glue added/removed.

Desired result in one sentence: review occurrences of \badnesses close to critical or amount of glue added close to maximum allowed (the concept is shown in the figures).

Reporting amount of glue added
Reporting badness higher than a threshold

(This is an approximate model of what could become the new quality criterion for LaTeX output.)

What can you advise?

UPDATE

I have reproduced a bothering example in which a manual \hyphenation of a word drastically improved the paragraph layout.

Minimal working example:

\documentclass[10pt]{memoir}
\usepackage[T2A]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[russian]{babel} % Enable Russian hyphenation
\usepackage{microtype} % See how even microtype fails
\righthyphenmin=2 % Russian language rules
\def\psk{\hskip1.5em\relax} % Parboxes and all that hard-coded stuff just pursue
% the illustrative aim to reproduce the example precisely

\begin{document}
\parbox[t]{226.15pt}{\psk И~он показывал какую-то странную позу, несколько
запрокинувшись назад, как бы полупадая от «истомлённости».}
% The badness is very close to 1000, and you see how bad it is

\parbox[t]{226.2pt}{\psk И~он показывал какую-то странную позу, несколько
запрокинувшись назад, как бы полупадая от «истомлённости».}
% Now the badness is over 1000

\parbox[t]{226.15pt}{\psk И~он показывал какую-то странную позу, не\-с\-ко\-ль\-ко
запрокинувшись назад, как бы полупадая от «истомлённости».}
% Since there must a hyphen anyway, this breaking is much more beautiful now!
% (And such hyphenation is perfectly legitimate.)
\end{document}

Badness close to 1000 due to missed hyphenation point

Underfull \hbox (badness 1009) in paragraph at lines 15--15
[] \T2A/cmr/m/n/10 (+20) И он по-ка-зы-вал какую-то стран-ную по-зу,

This is what I was talking about: there must be one hyphen in a paragraph, and both layouts 1 and 3 are not reported as bad, but the manually adjusted 3 is more beautiful. Of course one can run the document multiple times with \textwidth ranging, for instance, from 220 to 250 pt with step 5, and manually amend all those ugly lines by providing all possible breakpoints, but… You know… LaTeX documents are not meant to be improved by some hard-coding, r-right?

Nota bene: if \parboxes are used, then unfound hyphenation causes underfulls. If the same width is passed as a parameter to the geometry package and the text is typeset as normal paragraphs, unfound hyphenation causes overfulls. Both are odious, though.

P.S. I am aware of the http://tug.org/TUGboat/tb31-3/tb99isambert.pdf article by Mr. Paul Isambert that introduces a Lua(La)TeX way to look at the page grey evenness. Besides, the chickenize package provides the \colorstretch function that blindly evaluates everything. However, I do not deem highly of Lua(La)TeX’s robustness and stability (with respect to input) since there are so many things to manually detect and recode with some hand-kludged typography tools (thin spaces, thin nbsp’s, initial spaces—holy cow, there is no way to dispose of good old babel) in Unicode in place of nice and decent LaTeX macros! Just to illustrate that there is a not-so-robust solution which may or may not be reimplemented if LaTeX, please see the following example (polyglossia's hyphenation goes smash, too):

\documentclass[10pt,oneside]{memoir}
\usepackage{fontspec}
\usepackage{polyglossia}
\usepackage{microtype} % See how even microtype fails
\righthyphenmin=2 % Russian language rules
\setmainfont{Liberation Serif}
\setdefaultlanguage{russian}
\setlength{\parindent}{1.5em}
\usepackage[textwidth=200.2pt]{geometry}
\usepackage{chickenize}

\begin{document}
\colorstretch
И~он показывал какую-то странную позу, несколько
запрокинувшись назад, как бы полупадая от «истомлённости».

И~он показывал какую-то странную позу, не\-с\-ко\-ль\-ко
запрокинувшись назад, как бы полупадая от «истомлённости».
\end{document}

LuaLaTeX’s chickenize output

(Compiled on Linux Mint Debian without any additional fonts installed.) Well, this does not reproduce the exact spacing and goodness of the handcrafted paragraph, but gives a slight idea of what I desire to see in LaTeX—a mean of detection of possible inferior breaking that was caused by a missed hyphenation point.

Best Answer

On one hand, TeX usually knows better than most people, including me. On the other hand, I have always been annoyed by its reticence to actually display what's going on with line fullness, primarily because I find numbers in messages like Overfull \hbox (7.79364pt too wide) to be barely informative.

That's one of the reasons I was amazed discovering ConTeXt diagnostics and LuaTeX engine integration. But! There had to be a traditional TeX solution. That's what I came up with:

KAPOW

Let me break it down.

  • The number on the left is the line badness.
  • The box on the left is the line grayness. It is obtained by linearly mapping the badness range [0;100] to gray ranges [#808080;#FFFFFF] and [#808080;#000000] for stretched and shrinked lines respectively. The box is blue if the line is underfull and red if the line is overfull.
  • The bar on the right gives the line deformation. The border between orange and azure is placed at the line's natural length. The orange (azure) box is the available shrink (stretch) range. There is a white hairline at the actual line width. If the line is underfull (overfull) there is a blue (red) bar giving the missing (excess) length crossing the boundaries of allowed line deformations.
  • The number on the right is the shrink (stretch) amount if negative (positive). When the line is underfull (overfull) the number is the missing (excess) length, i.e. the length of the blue (red) bar.

The core of the routines for a single line box assessment is written in plain TeX; I present to you badger.tex:

% default rules abolished
\overfullrule=0pt

% ========================================================== GLUE STRAINING ====
% This is a procedure to detect glue finiteness and (if finite)
% quantify the total amount inside a box. Based on
% https://tex.stackexchange.com/a/191844/82186 by Bruno Le Floch

% to avoid choking on warnings
\hbadness=1000000
\hfuzz=\maxdimen

\newdimen\StrainedGlue
\newbox\tmp

\def\StrainGlue#1#2#3#4{
  \begingroup
    \dimen0 = -\maxdimen
    \dimen1 =  \maxdimen
    \loop
      \dimen2 = \dimen0
      \advance \dimen2 by \dimen1
      \divide \dimen2 by 2
      \ifdim \dimen2 = \dimen1
        \advance \dimen2 by -1sp \fi
      \setbox\tmp = \hbox spread #2 1sp {%
        \unhcopy#4\hskip 0pt #3 -\dimen2}
      \ifnum \badness > 100
        \dimen1 = \dimen2
      \else
        \dimen0 = \dimen2
        \advance \dimen0 by 1sp \fi
    \ifdim \dimen0 < \dimen1
      \repeat
    \global\StrainedGlue\dimen0
  \endgroup
}

% ============================================================ BOX ANALYSIS ====

% stashes for the data
\newdimen\FittedWidth
\newdimen\NaturalWidth
\newdimen\Deformation
\newdimen\MaxStretch
\newdimen\MaxShrink
\newdimen\OverStretch
\newdimen\OverShrink
\newcount\LineBadness
\newcount\Grayness

\def\AssessBox#1{
  \setbox0 = \hbox to \wd#1 {\unhcopy#1}
  \FittedWidth = \wd0
  \LineBadness = \badness
  \setbox0 = \hbox {\unhcopy#1}
  \NaturalWidth = \wd0
  \StrainGlue{shrink}{-}{minus}{0} \MaxShrink = \StrainedGlue
  \StrainGlue{stretch}{}{plus}{0}  \MaxStretch = \StrainedGlue
  \Deformation = \dimexpr\FittedWidth-\NaturalWidth\relax
  \OverStretch = \dimexpr\Deformation-\MaxStretch\relax
  \OverShrink = \dimexpr-\Deformation-\MaxShrink\relax
  \definecolor{Grayness}{rgb}{0,1,0}
  \ifnum \LineBadness = 1000000
    \definecolor{Grayness}{rgb}{1,0,0}
  \else\ifnum \LineBadness < 100
    \Grayness = \numexpr50\ifdim\Deformation>0pt+\else-\fi\LineBadness/2\relax
    \definecolor{Grayness}{gray}{0.\the\Grayness}
  \else
      \definecolor{Grayness}{rgb}{0,0,1}
  \fi\fi
}

% ========================================================= MARKERS DRAWING ====

\def\marker#1#2%
  {{\color{#1}\vrule width #2 height \ht\strutbox depth \dp\strutbox}}

\definecolor{OverShrink} {rgb}{1.0,0.0,0.0}
\definecolor{Shrink}     {rgb}{1.0,0.5,0.0}
\definecolor{Stretch}    {rgb}{0.0,0.5,1.0}
\definecolor{OverStretch}{rgb}{0.0,0.0,1.0}

% \tenthpt adapted from Michael J. Downes' showdim package
\def\tenthextract#1.#2#3\relax{#1\ifnum#2=0 \else.#2\fi}
\def\tenthpt#1{\dimen0#1\relax
  \advance\dimen0\ifdim\dimen0<0pt-\fi.05pt
  \expandafter\tenthextract\the\dimen0\relax pt}

\def\BadnessMarkers{%
  \llap{\smash{%
    \ifnum\LineBadness=1000000\relax$\infty$\else\the\LineBadness\fi%
    ~\marker{Grayness}{\baselineskip}}}%
  \ifnum\MaxStretch=\maxdimen\else\ifnum\MaxShrink=\maxdimen\else%
    \rlap{\hskip\hsize\smash{%
      \ifnum\LineBadness=1000000\relax%
        \rlap{%
          \marker{OverShrink}\OverShrink%
          \marker{Shrink}\MaxShrink%
          \marker{Stretch}\MaxStretch}%
        \llap{\tenthpt\OverShrink\hskip-4\baselineskip}%
      \else\ifnum\LineBadness>100\relax%
        \llap{%
          \marker{Shrink}\MaxShrink%
          \marker{Stretch}\MaxStretch%
          \marker{OverStretch}\OverStretch}%
        \llap{\tenthpt\OverStretch\hskip-4\baselineskip}%
      \else%
        \llap{%
          \llap{\marker{Shrink}\MaxShrink}%
          \rlap{\marker{Stretch}\MaxStretch}%
          \hskip\Deformation\relax}%
        \marker{white}{1sp}%
        \llap{\tenthpt\Deformation\hskip-4\baselineskip}%
      \fi\fi}}%
  \fi\fi%
}

Now, about the paragraph analysis. There are various ways to apply the routines depending on your situation.

If we are talking plain TeX then we can do the boldest thing and alter the output routine. This is particularly nice because it's the easiest way to handle interline and interparagraph glue. Here is plain.tex, an example to toy with:

\input miniltx
\input color.sty

\input badger

% code adapted from a TeX pearl by Paweł Jackowski (Custom overfull text)
% http://www.tug.org/TUGboat/tb29-1/tb91pearls.pdf

\interlinepenalty=-50000 % force the break between each two lines
\maxdeadcycles=50        % allow upto 50 \outputs with no \shipout
\newtoks\orioutput \orioutput=\output % wrap the original \output routine
\output
    {\ifnum\outputpenalty>-20000 \the\orioutput
     \else \ifnum\outputpenalty<-\maxdimen \the\orioutput
     \else
     \unvbox255        % flush the entire list back
     \setbox0=\lastbox % strip the very last box
     \nointerlineskip  % avoid doubled interline glue
     \AssessBox0
     \hbox to \FittedWidth{\BadnessMarkers\unhbox0}
     \advance\outputpenalty by 50000
     \penalty\outputpenalty % weak lie that nothing happened...
     \fi\fi}

\hsize=2in
\input knuth
\bye

I wouldn't dare doing that with LaTeX, though. However, a topical application of the analysis routine is possible too. Here is la.tex, the example that produces the screenshot above:

\documentclass{article}
\usepackage{color}

\input badger

% code inspired by an answer by David Carlisle
% https://tex.stackexchange.com/a/56853/82186

\newskip\savedskip
\newcount\savedpenalty
\newdimen\olddepth
\newbox\linebox
\newbox\parabox

\def\eat{
  \loop
    \setbox\linebox\lastbox
    \savedskip\lastskip\unskip
    \savedpenalty\lastpenalty\unpenalty
    \ifvoid\linebox\else
      \AssessBox\linebox
      \setbox0=\hbox to \hsize{\BadnessMarkers\unhcopy\linebox}
      \global\setbox\parabox\vbox{%
        \penalty\savedpenalty
        \vskip\savedskip
        \box0
        \unvbox\parabox}%
  \repeat}

\def\dissect#1\par{%
 \olddepth\prevdepth%
 \setbox0\vbox{\hbox{\vrule depth\olddepth}\par#1\par\eat}%
 \unvbox\parabox}

\begin{document}\vfil
{\bfseries A beautiful paragraph follows.}\par\vfil
\dissect\input zapf\par\vfil
{\bfseries Here comes an acceptable one.}\par\vfil
{\hsize=3in\dissect\input zapf\par}\vfil
{\bfseries Now the ugly.}\par\vfil
{\hsize=2in\dissect\input zapf\par}\vfil
\end{document}

That's just one of the possible approaches. One could use recursive macro calls instead of looping. Or maybe use \everypar to apply the marks globally: that would be interesting!

Keep in mind that the examples I wrote are a bit rough; I particularly dislike the imperfect handling of interparagraph spacing of the second one. However, the analysis is sound and it seems to me that most line-level data is clearly readable this way.

I will try to fix the spacing issues someday so this can carelessly be used as an on/off drafting tool.