[Tex/LaTex] TeX’s algorithms for “breaking paragraph into lines” and “setting the glue”

tex-core

I want to learn more about the implementation of TeX's text-justification algorithm (setting the glue).

After scanning through ch.12 of Knuth's TeXbook, I'm guessing that setting the glue must consist in a constrained linear program

whose variables represent the interword glue, and
whose constraints correspond to shrinks & stretches allowed for each interword glue.

However, if my intuition is correct, the linear program in question must get prohibitively large pretty quickly as the length of the document increases, at least, for computers back in the day. Does setting the glue use some heuristic instead?

EDIT: Thanks to egreg for his answer. However, I realise I should have included the paragraph-making algorithm in my question.

I'm still not quite clear on the interplay between

the paragraph-making algorithm (which, I presume, determines where line breaks occur), and
the text-justification algorithm (which determines the interword glue).

It seems to me that the two problems are not independent. Does the paragraph-making algorithm, in the process of generating line break after line break, not use information about each glue item (natural width, stretch and shrink components), too?

Best Answer

There's no linear programming involved.

TeX stores items in lists (vertical, horizontal or math lists). Math lists are converted to horizontal lists, so only the first two must be discussed, but they are alike when setting the glue is involved.

Glue is stored as a glue item, represented with its natural width, stretch component and shrink component.

When TeX has to build a box having a certain width from a horizontal list or a certain height from a vertical list (lines in a paragraph or a page, for instance), it sums up all the natural widths, adds the width of the non-glue items and computes how much the glue has to be stretched or shrinked.

If stretching s is required and t total stretch is available, TeX computes the "stretch ratio" r = s/t; and if a glue item has x stretch component, TeX adds rx to the natural width.

The situation is similar for shrinking, with the difference that TeX will never go below the stated shrinking and will output an overfull box if the total shrinking available is not sufficient.

There is a little complication because stretching or shrinking can be expressed in "infinite units". Well, the computation of the total stretching or shrinking available keeps the higher order of infinity resulting from the computation and adds (or removes) from the natural width only to the glue items which match that order of infinity in their specification.

These are very easy and fast operations on any computer.

The paragraph making algorithm takes into account stretchability and shrinkability of the glue items in the horizontal list it has built only in an indirect way.

Glue items mark feasible line break points, provided they are preceded by a non-discardable item (also other items mark feasible breaks). During this phase TeX looks at the natural widths of the items and, to put it simply, chooses a sequence of break points such that no line will exceed the current \hsize.

Actually this choice is based on many parameters, on penalties found in the list (or implicit penalties) and on computed demerits. Possible break points are also discretionary hyphens that may have been automatically inserted during the process.

For each feasible break point, TeX computes the glue ratio for the line that would result, which assigns a badness to the line that will be used in the computation of the total demerits. However glue is not set at this point, but only when TeX has chosen the best sequence of break points (the best according to its rules and the values of the various involved parameters): the lines are put in hboxes of width \hsize, essentially \hbox to \hsize{...} and now the glue is set as explained above.

Related Solutions

[Tex/LaTex] How to change \hsize in the middle of a paragraph at pagebreak, especially for two-column

I don't know about using Lua, but I suspect that what you want is not possible with pdfTeX. As Joseph points out, you cannot change a number of parameters in the middle of the paragraph. The \hsize that is in effect at the end of the paragraph is the one that gets used in TeX paragraph builder.

As for column breaks, TeX accumulates vertical material in its "main vertical list" until it has enough to fill a page and then it runs the output routine. (That's a simplification; see The TeXbook or TeX by Topic for full details.) What this means here is that the paragraph has already been broken up into boxes of width \hsize and added to the vertical list (along with some glue between them to keep the baselines \baselineskip apart). So by the time TeX gets around to deciding on the break point for the first column, material that will appear in the final column has already been typeset.

Edit:
In the comments, Hendrik asks about abusing the output routine and unboxing. Even ignoring the difficulty of finding exactly which material needs to be unboxed, I can't think of a good way to deal with hyphens:

\setbox0\vbox{
        \hsize40pt
        \rightskip=5pt
        \parindent=0pt
        hyphen\-ation
}
\setbox0\vbox{
        \unvbox0
        \global\setbox1\lastbox
        \unskip
        \unpenalty
        \setbox2\lastbox
        \global\setbox1\hbox{%
                \unhbox2
                \unskip
                \
                \unhbox1
                \unskip
                \unskip
                \unpenalty
        }
}
\unhbox1
\bye

Here we set box 0 to have known contents.

\vbox(18.94444+0.0)x40.0
.\hbox(6.94444+1.94444)x40.0
..\hbox(0.0+0.0)x0.0
..\tenrm h
..\kern-0.27779
..\tenrm y
..\tenrm p
..\tenrm h
..\tenrm e
..\tenrm n
..\discretionary
..\tenrm -
..\glue(\rightskip) 5.0
.\penalty 400
.\glue(\baselineskip) 3.37697
.\hbox(6.67859+0.0)x40.0, glue set 12.77771fil
..\tenrm a
..\tenrm t
..\tenrm i
..\tenrm o
..\tenrm n
..\penalty 10000
..\glue(\parfillskip) 0.0 plus 1.0fil
..\glue(\rightskip) 5.0

Since we know that it has two lines, we can directly pull it apart and stuff the contents into box 1.

\hbox(6.94444+1.94444)x60.5557
.\hbox(0.0+0.0)x0.0
.\tenrm h
.\kern-0.27779
.\tenrm y
.\tenrm p
.\tenrm h
.\tenrm e
.\tenrm n
.\discretionary
.\tenrm -
.\glue 3.33333 plus 1.66666 minus 1.11111
.\tenrm a
.\tenrm t
.\tenrm i
.\tenrm o
.\tenrm n

Whereas this would have worked had the word not been hyphenated, here we get
"hyphen- ation"

TeX Line-Breaking Without Hyphenation – Is It Worthwhile?

It does not have any serious impact on performance on modern machines and I can vouch on old machines as well. Depending on your settings more than 50% of text would normally pass through the first pass. Here is a figure of two tests (the red numbers denote badness):

enter image description here

The tests were carried out using code posted by Wilson on Git. Personally I would recommend let the \pretolerance stay at 100 it will probably be faster (as you do not force the other passes in the majority of cases).

\documentclass{article}
\usepackage{xcolor}
%%% Code from GIT posted by Wilson

\frenchspacing
\fussy

\makeatletter
\newbox\trialbox
\newbox\linebox
\newcount\maxbad
\newcount\linebad
\newcount\bestbad
\newcount\worstbad
\newcount\overfulls
\newcount\currenthbadness

\def\trypar#1\par{%
  \showtrybox{\linewidth}{#1\par}%
}

\newcommand\showtrybox[2]{%
  \currenthbadness=\hbadness
  \maxbad=0\relax
  \setbox\trialbox=\vbox{%
    \hsize#1\relax#2%
    \hbadness=10000000\relax
    \eatlines
  }%
  \hbadness=10000000\relax
  \setbox\trialbox=\vbox{%
    \hsize#1\relax#2%
    \printlines
  }%
  \noindent\usebox\trialbox\par
  \hbadness=\currenthbadness
}

\newcommand\trybox[2]{%
  \currenthbadness=\hbadness
  \maxbad=0\relax
  \setbox\trialbox=\vbox{%
    \hsize#1\relax#2\par
    \hbadness=10000000\relax
    \eatlines
  }%
  \hbadness=\currenthbadness
}

\def\eatlines{%
  \begingroup
  \setbox\linebox=\lastbox
  \setbox0=\hbox to \hsize{\unhcopy\linebox\hss}%
  \linebad=\the\badness\relax
  \ifnum\linebad>\maxbad\relax \global\maxbad=\linebad\relax \fi
  \ifvoid\linebox\else
    \unskip\unpenalty\eatlines
  \fi
  \endgroup
}

\def\printlines{%
  \begingroup
  \setbox\linebox=\lastbox
  \setbox0=\hbox to \hsize{\unhcopy\linebox}%
  \linebad=\the\badness\relax
  \ifvoid\linebox\else
    \unskip\unpenalty\printlines
    \ifhmode\newline\fi\noindent\box\linebox\showbadness
  \fi
  \endgroup
}

\def\showbadness{%
  \makebox[0pt][l]{%
    \ifnum\currenthbadness<\linebad\relax
      \ifnum\linebad=1000000\relax\expandafter\@gobble\fi
      {\quad\color{red}\rule{\overfullrule}{\overfullrule}~{\footnotesize\sffamily(\the\linebad)}}%
    \fi
  }%
}

\makeatother

\begin{document}

\hbadness=-1 
\begin{minipage}[t]{4.5cm}
\trypar\hyphenpenalty=500\looseness=1
In olden times when wishing
still helped one, there lived a
king whose daughters were all
beautiful, but the youngest was so
beautiful that the sun itself,
which has seen so much, was
astonished whenever it shone in
her face. Close by the king's
castle lay a great dark forest,
and under an old lime-tree in the
forest was a well, and when
the day was very warm, the
king's child went out into the 
forest and sat down by the side
of the cool fountain, and when she was bored she
took a golden ball, and threw it up on a high and caught it, and this
ball was her favorite plaything. \par
\end{minipage}
\hspace{2cm}
\begin{minipage}[t]{4.5cm}
\trypar\hyphenpenalty=10000\looseness=1
In olden times when wishing
still helped one, there lived a
king whose daughters were all
beautiful, but the youngest was so
beautiful that the sun itself,
which has seen so much, was
astonished whenever it shone in
her face. Close by the king's
castle lay a great dark forest,
and under an old lime-tree in the
forest was a well, and when
the day was very warm, the
king's child went out into the 
forest and sat down by the side
of the cool fountain, and when she was bored she
took a golden ball, and threw it up on a high and caught it, and this
ball was her favorite plaything. \par
\end{minipage}
\end{document}

Best Answer

Related Solutions

[Tex/LaTex] How to change \hsize in the middle of a paragraph at pagebreak, especially for two-column

TeX Line-Breaking Without Hyphenation – Is It Worthwhile?

Related Question