[Tex/LaTex] unexpected interaction between \raggedright and \parshape

horizontal alignmentline-breakingparagraphstex-core

While answering Random start of line, I came upon a strange interaction between \raggedright and \parshape.

If a ragged-right paragraph ends in a short line, this line will be "the most ragged".

\documentclass{article}
\begin{document}
\raggedright
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
\end{document}

The effect of <code>\raggedright</code> in LaTeX

Now watch this: I add a dummy \parshape (it does nothing, really) and all of a sudden, it's the first line that is short!

\documentclass{article}
\begin{document}
\raggedright
\parshape 4
0cm \textwidth
0cm \textwidth
0cm \textwidth
0cm \textwidth
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
\end{document}

The joint effect of <code>\raggedright</code> and <code>\parshape</code>

Let's explore further. The resulting paragraph is three lines long. What happens if the \parshape specification consists of three items? We're back to normal!

\documentclass{article}
\begin{document}
\raggedright
\parshape 3
0cm \textwidth
0cm \textwidth
0cm \textwidth
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
\end{document}

With shorter <code>\parshape</code>, normality

In general, we get the unexpected result if and only if the \parshape specification has at least one item more than the resulting paragraph. I guess that's why all this is hard to notice: usually, the \parshape is of a zero length.

Exploring even further, I have discovered that plain TeX behaves ok. (To facilitate comparison, I have set plain TeX's text width to equal the LaTeX default.)

\def\is{%
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i}
\hsize 345pt

\parskip 2ex
\raggedright
\is

\parshape 3
0cm \hsize
0cm \hsize
0cm \hsize
\is

\parshape 4
0cm \hsize
0cm \hsize
0cm \hsize
0cm \hsize
\is
\bye

The plain TeX's result

The difference between plain TeX and LaTeX suggests that LaTeX's definition of \raggedright is to blame.

TeX:

> \raggedright=macro:
->\rightskip \z@ plus2em \spaceskip .3333em \xspaceskip .5em\relax .

LaTeX:

> \raggedright=macro:
->\let \\\@centercr \@rightskip \@flushglue \rightskip \@rightskip \leftskip \z@skip \parindent \z@ .

Indeed, LaTeX's definition sets \rightskip to \@flushglue, which is 0.0pt plus 1.0fil, and thus probably responsable the described problem. Defining (in plain TeX)

\def\raggedright{\rightskip 0pt plus1.0fil \spaceskip .3333em \xspaceskip .5em\relax}

makes plain TeX behave like LaTeX. (Almost: for a perfect match (in this case) write \def\raggedright{\parindent 0pt\rightskip 0pt plus1.0fil\relax}.)

I have also checked this with the original (non-pdf, non-e) TeX. The behaviour is the same. Btw, I use TeX version 3.1415926-2.4-1.40.13 (TeX Live 2012/Arch Linux).

My question (finally) is actually manifold:

  1. How does this happen? I.e. I'd like to have the details of how TeX computes the paragraph shape resulting in a short first line.

  2. How is this even possible? I have read all I could find on \parshape in TeXbook and TeX by Topic, and searched the net, but nowhere does it say that \parshape could have a different effect if the last item is repeated (one time too many). Is this a bug in TeX??!? … … I guess not: I just don't believe I'm lucky enough to stumble upon one and gain both fortune and eternal glory in a single strike :-)))

  3. What, precisely, were the desired effects of plain TeX's and LaTeX's \raggedright? Obviously to have a ragged right edge 🙂 Anything else?

    3a. How do the various components of the definition contribute to achieving the goal? (A separate question for plain TeX and LaTeX.)

By the way, the problem that this issue had caused to the OP of Random start of line was resolved by redefining his \raggedright to a plain TeX-like version. (So my question here was an academic one.)

EDIT: At first I thought that LaTeX's ragged right edge looked nicer than plain TeX's one. But then Stephan's comment made me see that I just haven't looked at the right examples. So I withdraw that opinion, and I have deleted all the references to it from the question.

UPDATE

Looking at the TeX source, I've figured out two more ways to trigger the phenomenon: a combination of \raggedright and

  • hanging indents if abs(\hangafter) >= number of lines in the paragraph (given the \parshape problem, this is not really unexpected)

  • non-zero \looseness, as illustrated below. In fact, the greater the (positive) looseness, the more horrible the result. 🙂

    \documentclass{article}
    \begin{document}
    \parindent 0pt
    \looseness=1
    \raggedright
    i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
    i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
    i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i 
    i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i
    \end{document}
    

\raggedright and non-zero \looseness

If I'm not much mistaken, \parshape, \looseness and hanging indents should be a complete list.

Best Answer

While I'm very grateful to Stephan and Joseph for their answers (and in particular to Joseph for info about the LaTeX3 team's discovery of the same problem), I must admit to an unsatisfiable curiosity: I had to dig further. So I delved into the TeX source ...

Theory

Let me reiterate what is for me the crucial part of my question. I was perplexed by the fact that different ways of conveying the same instruction (e.g. either by writing \parshape 3 0cm \hsize 0cm \hsize 0cm \hsize or nothing) can have different effects.

But now I know! TeX's line-breaking algorithm can operate in two modes: I'll call them easy and hard. Obviously, it is the hard mode that is computationally more expensive, thus the easy mode is used whenever possible. But, the first big subquestion, at least for me (a complete newby in the theory of line-breaking), was: why does TeX need the hard mode at all? What cannot be done in the easy mode?

Easy and hard: why?

First take a look at the upper left paragraph. Notice that, on the left, words multilingual typesetting are typeset quite badly. Could they be typeset any better if word TeX's was somehow pushed one line down? Obviously not (see the upper right paragraph), since all the following lines are of the same length.

The situation is radically different when not all of the lines following the breakpoint TeX's are of equal length. The gaping gap between multilingual and typesetting in (lower left) disappears once the words are pushed one line down (lower right)!

why hard mode

\documentclass{article}
\usepackage{fullpage}
\pagestyle{empty}

% text taken from Wikipedia's article 'TeX'
\def\history{However, since the source code of \TeX\ is essentially in the public domain\paren{}, other programmers are allowed\paren{ (and explicitly encouraged)} to improve the system, but are required to use another name to distribute the modified \TeX, meaning that the source code can still evolve. For example, the Omega project was developed after 1991, primarily to enhance \fbox{\TeX's}\break \emph{multilingual typesetting} abilities. Donald Knuth himself created "unofficial" modified versions, such as \TeX-XeT, which allows a user to mix texts written in left-to-right and right-to-left writing systems in the same document.}  \def\paren#1{}
\def\pushdown{\let\paren\relax}

\def\Long{0pt \hsize}
\def\Short{0pt 130pt}
\def\parshapeP{\parshape 8 \Long\Long\Long\Long\Long\Long\Long\Short}
\def\parshapeC{\parshape 9 \Long\Long\Long\Long\Long\Long\Long\Short\Long}

\begin{document}
\tracingparagraphs=1
\hyphenpenalty=10000
\parindent 0pt
\parskip 3ex

\begin{minipage}[t]{222pt}\parshapeP\history\end{minipage}
\quad\quad
\begin{minipage}[t]{222pt}\parshapeP\pushdown\history\end{minipage}

\begin{minipage}[t]{222pt}\parshapeC\history\end{minipage}
\quad\quad
\begin{minipage}[t]{222pt}\parshapeC\pushdown\history\end{minipage}

\end{document}

(Obviously, TeX doesn't push a breakpoint down by adding some text to the paragraph as I did. (That might be Word:-) TeX works with flexible interword spacing and if the flexibility becomes one line long ... voila, two possibilities for one position! By the way, while TeX's default spacing might be increased by setting \tolerance/\spaceskip/... (or using LaTeX's \raggedright!), one line of extra space is easily found in the real world as well: just write a long and/or narrow paragraph!)

TeX's line-breaking algorithm operates by travelling through the paragraph from the beginning to the end, remembering along the way which positions might become breakpoints. Assume that TeX has realized that the current position could break more than one line. If all lines following the current position are of equal length, TeX knows that its decision about this matter won't influence the quality of the subsequent line-breaks: thus it can take the easy way, remembering only the option that makes the preceding part of the paragraph look best (here TeX can make a perfect decision since it has already travelled through it) and forgetting about other possibilities. However, if not all lines following the current position are of equal length, TeX has no choice but to go the hard way: it must remember all the possibilities.

The line-level: everything's easy and right

TeX's line-breaking algorithm is actually one big pruning machine. The dumbest way to calculate the best sequence of breakpoints in the paragraph would be to list all the possibilities and evaluate each of them; however, in a paragraph containing n legitimate breakpoints, this would yield a staggering 2^n sequences. So, TeX eliminates possibilities as early as possible. We have seen above how TeX goes into the "easy" mode whenever possible: easy mode is less costly because it prunes more.

On the "line-level", easy and hard mode behave the same. Consider the word modified at the end of the fourth line, upper left paragraph. When TeX was evaluating the possibility of breaking the line after this word, it had two options (see the log excerpt, @@6): whether to create a line starting after word required (@via @@4) or after to (@via @@5). Similarly to the easy mode described above, the decision cannot possibly influence the line breaks at the subsequent positions, therefore the locally better of the two options is selected (@@4, the one yielding less demerits to the beginning of the paragraph) and the other one forgotten.

al-lowed to im-prove the sys-tem, but are re-quired 
@ via @@3 b=99 p=0 d=11881
@@4: line 3.1 t=1759534 -> @@3
to 
@ via @@3 b=0 p=0 d=10100
@@5: line 3.2 t=1757753 -> @@3
use an-other name to dis-tribute the mod-i-fied 
@ via @@5 b=2376 p=0 d=5702996
@ via @@4 b=285 p=0 d=87025
@@6: line 4.0 t=1846559 -> @@4

A seemingly innocuous question: what happens if two (or more) possibilities yield the same demerits? The answer, directly from the source: the rightmost one is selected.

The reason, in detail. When TeX evaluates whether the current position could become a breakpoint, it does so by traversing (from left to right) the list of active nodes. (Active nodes are TeX's representation of potential breakpoints --- breakpoints which precede the current position and were deemed feasible. Each active node contains a link to some other active node (which precedes it in the text). Each active node is thus a head of a chain of breakpoints leading back to the beginning of the paragraph.) At each node, TeX first checks whether the current position is a feasible breakpoint with respect to the current active node, i.e. whether the badness of the line from the current active node to the current position is within the user-given tolerance. If so, TeX will then compare the total demerits (first approximation: the sum of badnesses of individual lines) of the current chain to the minimum total demerits found at the current position so far. If the current total demerits are less than or equal to the minimum total demerits, the break is recorded (the previous candidate overwritten).

In short: the rightmost option is chosen due to left-to-right processing and less than or equal to comparison.

(A knowledgeable reader has certainly noticed that I have disregarded the fitness classification. This was intentional, as it is not relevant to the original problem. Actually, there's a lot of things I am disregarding...)

Active nodes: line by line

We're getting close to the heart of the problem.

How do the active nodes get into the (active) list? Here, the two modes are radically different. In the easy mode, the line numbers (of active nodes) are irrelevant, so TeX creates at most one active node per current position: it links it to the (rightmost) active node yielding least total demerits of all the active nodes. The new active node is then appended to the active list. This is good, as it preserves the linear order; in other words, the order of the active nodes in the active list matches the relative order of the corresponding positions.

In the hard mode the situation is somewhat more complex. TeX must group the active nodes by line-number classes, i.e. for each line number, it must select the best active node with that line number. The problem is, how can this be done efficiently? (Think '80, not python!)

Knuth's solution was quite ingenious. (Although, as we shall see, it has backfired a bit.) The active list is sorted by line numbers. Thus, all the active nodes with a given line number can be found in a single block. This makes it trivial both to compute the total demerits minima with respect to line-number classes and to insert new active nodes in the appropriate positions. Let's see how. Assume that TeX is currently traversing the line-number class n block of the active list, updating the minimum demerits along the way. When TeX hits an active node from the next class (line number >n) (or the end of the active list), (i) TeX has already seen all the active nodes of class n, so the computation of the minimum demerits for class n is done: the new active node can be created; and (ii) the position (immediately after line class n) is ideal for the insertion of the new active node, whose line-number is n+1: the sorting order (by line numbers) is respected by inserting the new node just before the current active node (which is the first node of class >n).

Obviously, the consequence of this procedure is that the new active node will be inserted before the other nodes of its class (if there are any in the active list). However, the new active node's position in the paragraph is the current position, which follows the positions of the "old" active nodes. Thus, the hard mode will reverse the order of active nodes within each line-number class (for an example, see the order of @ via @@4 and @ via @@5 in the log excerpt above).

In effect, the answer to the innocuous question was actually incorrect, or at least imprecise. True, it is the rightmost node that wins, but rightmost in the linear order of the active list, which is not necessarily rightmost in the sense of the position in the paragraph. (Here, right should read as "closer to the end".)

LaTeX's \raggedright

Understanding the problem with LaTeX's \raggedright is now trivial. It defines a \rightskip of infinite strechability. Therefore, all lines can shrink without penalty. Therefore, any sequence of breakpoints is equally good. (More precisely: any sequence yielding the same number of lines, under the usual positive setting of \linepenalty. Normally, a sequence with the least lines will be selected.) Therefore, the processing order plays a major role in the selection of the breakpoint chain.

Practice

Is this a bug?

The question is what should happen in the case where there is more than one optimal solution to the line-breaking problem, i.e. if the minimum demerits are achieved by more than one sequence of breakpoints.

As far as I know (and I know [1,2,3]), Knuth does not mention this situation and thus makes no promise about the outcome. And we really shouldn't blame him for that. Think about it: how could one even describe which of the equivalent breakpoint sequences will be selected?!? I don't see any way aside of asking the user to study the algorithm...

So, on one side, no promise no bug.

However, users (and especially users of TeX, we're spoiled!;-) expect consistent behaviour of their programs. The point is, I'm positive that anyone (who hasn't studied TeX's source) would think of the easy mode simply as a special case of the hard mode: "Oh, I don't have to write \parshape=1 0cm \hsize before every paragraph! How convenient!!!". Or, how could anyone expect \parshape=1 x y not to have precisely the same effect as \parshape=2 x y x y, when it is clearly stated in the TeXbook, that the specifications of \parshapes last line will be "repeated ad infinitum"? (A Knuth-style exercise: why can the last two \parshapes have different effects?)

In this sense, this is certainly a bug.

avaniTeX

We're in the "Practice" section, but I actually did my programming exclusively out of academic interest --- as a proof of concept. See, I couldn't have written this answer with (almost) absolute conviction if I wasn't able to fix the "problem", i.e. unify the behaviour of the easy and the hard mode. I have thus made a patch and even named it --- I wouldn't want to violate TeX's licence! It is called, in a show of a complete lack of imagination, avaniTeX (avani is a nickname I used to use on the web). You can download it from my webpage. And here's how it works.

One diabolical idea was to "fix" the easy mode. (Diabolical since it would of course corrupt all flushleft environments in LaTeX :-) The patch consisted of a single instruction: replace a certain <= by a < :-). However, this doesn't really work since I have later discovered that the active nodes are not simply reversed, but reversed by the line number class.

The correct approach is to fix the insertion of the active nodes: we want to insert a new node of class n+1 after all the other nodes of class n+1 --- this will keep the active list order in sync with the paragraph order. In principle that's actually quite simple to implement (without any serious computational overhead). Instead of inserting the new node directly to the active list, it is rather appended to the waiting list, which is inserted into the active list when the line-number class n+1 ends. This is fine since the waiting list is populated precisely in the moment of the transition between the classes. (By the way, the waiting list must be a list because it potentially needs to store the new active nodes for more than one fitness.)

However, dealing correctly(?) with delta nodes was a nightmare.

Does it work? More or less ... I had no problem recompiling my documents with it. The TRIP test (very useful, it found many of my mistakes) fails at the second run (the first couple of "pages" of trip.typ are ok, but the rest is garbage). ---Just before posting this, I have discovered that a problem occurs in the emergency pass when the active list gets empty: that's where TeX typesets overfull boxes. I'm not going to investigate this at the moment: Knuth warns about that particular piece of code that "readers who seek to "improve" TeX should therefore think thrice before daring to make any changes here." :-)

As a final note, did I mention that I hate Pascal? :-))) But I do love WEB, and have become a great fan of literate programming and I promise to document my code better from now on. Seriously, reading Knuth's code is a joy. Try it out sometime.

  1. The TeXbook.
  2. Donald E. Knuth and Michael F. Plass (1981). Breaking Paragraphs into Lines. Software---Practice and Experience, vol 11, 1119--1184.
  3. TeX82: the documented source.