[Tex/LaTex] How to define the badness of a river

line-breakingparagraphsriverstypography

I've written an algorithm to try and detect rivers in paragraphs and it actually detects quite a lot when I run it. Some of them are clearly false positives, but there are others that are indeed aligned spaces on consecutive lines. Here are some, colored in green in the following picture:

a collection of rivers

When are rivers really problematic and/or ugly? Are there rivers in this example that are worth fixing?

What are the parameters (and their importance) to qualify the "badness" of a river, and how could they be calculated?

As an additional question, there doesn't seem to be a standard definition of a river. Defining a river properly would surely help to define the parameters that make it bad. How would you define a river?

Best Answer

After reading lockstep's and Lev's answers, here is my own take. It seems to me that there's 4 main factors that make a river bad:

Its orientation: The straighter, the worse;
Its width: the larger, the worse;
The constancy of its width: the more constant, the worse;
The length: the longer, the worse.

From this, I guess I could try to improve the algorithm by checking the following:

Find overlapping spaces (which I already do) on as many lines as possible (instead of just 3);
Try to approximate the river by a linear regression and retrieve a regression factor;
Measure the width of each node, and calculate mean μ and standard deviation σ.
Based on all this, calculate the badness based on:
- the regression factor (the closest to a straight line, the worse), which might be some kind of MSE,
- the mean width μ divided by the standard space width ω (the larger -- compared to the standard word space, the worse),
- the standard deviation of width σ (the smaller, the worse),
- the length (number of lines n, the longer, the worse).

The badness could be something like (with a factor α to normalize it):

suggested formula

where

alpha=10000 w

to set the maximum badness to 10000.

I suggest to square σ and MSE since they appear to be more important factors than μ and n.

With this formula, we would have:

b tends towards 10000 when μ tends towards infinity (maximum badness for very large spaces);
b tends towards 10000 when n tends towards infinity (maximum badness for a lot of lines);
b tends towards 0 when μ tends towards 0 (smaller spaces reduce badness);
b tends towards 0 when n tends towards 0 (smaller amount of lines reduce badness);
b tends towards 10000 when σ tends towards 0 (monospaced text increases badness);
b tends towards 10000 when MSE tends towards 0 (perfectly aligned spaces are sure to be really bad);
b tends towards 0 when σ tends towards infinity (different spaces tend to reduce badness);
b tends towards 0 when MSE tends towards infinity (unaligned spaces do not lead to real rivers).

My definition of a river would then be:

an accidental series of aligned spaces of constant width on 3 or more consecutive lines.

Edit: As Bruno noted, α and ω are not really used in the calculation since we fix the maximum badness to 10000 anyway. Also the algorithm can be simplified by not calculating μ since nμ is simply the sum of all widths:

enter image description here

with:

enter image description here

Edit 2: I'm actually considering to use something like S+ σ + MSE in the denominator instead of S * (σ * MSE)^2. The reasons for that are:

When σ is zero (perfectly identical spaces), that doesn't make the river necessarily bad, it still depends on MSE (the alignment);
When MSE is zero (perfectly aligned spaces), that doesn't make the river necessarily bad, it still depends on the size of spaces;
I'm not sure squares are necessary for σ and MSE (but experiments will have to tell) since they're already squared differences.

As a little progress note, here is Lev's excellent example converted to LuaTeX + fontspec:

\documentclass{article}
\usepackage{fontspec}
\setmainfont[Ligatures=TeX]{Minion Pro}
\usepackage{microtype}
\usepackage[draft,rivers]{impnattypo}
\begin{document}
\noindent\parbox{8.5cm}{\hspace{15pt}
eget niis non lobero at conseyquat lacus. Vestibulum eg
Lorem ipsum dolor sit amew jonsectetun ad PL Wilson elit
Pellentesque nec turpis nisv Ac lobortis ballacus.  Ut fringil
nis, non ipsum gravida sep `doltrices' odio dictub.  Tam id l
fermintum dolor. Pail NT cabitant morbi istiqleith  vibendi
senectus et netus erepaw dalesuada fames ic turpak  wegest
Nam ac nunc vel nique.  aliquam dictum etat magna.  Thats
risus neque. `Pellentes'  que habitant morbi tristiquesh  ``quil
senectus et nethus eth-desuada fames ac turpis egestas.}
\end{document}

and ran into my current algorithm:

a bad detection

It detects quite a few things... except the 2 mighty rivers, which don't actually overlap on 3 consecutive lines... There's still quite some work to do...

As for the overlapping issue, it seems to me that the bigger the interline space, the more space between spaces is possible. If lines are very close, spaces really have to overlap in order to create a river, but if lines are very loose, then spaces that are actually distant horizontally can create a diagonal river, too.

Update: I considered that rivers are below 45° (with a vertical line), and in this case, the overlap can be taken + or - the line height. So the new algorithm considers that spaces do not necessarily have to overlap strictly vertically, but the overlap can be + or - the distance between the two lines. The result with Lev's example is this:

Allowing + or - line height in overlap

Next step will be to analyze on more than 3 lines (as I still do) and define and apply a river badness to eliminate false positive rivers. This seems to be a bit harder since I have to define a list object in Lua to chain the nodes that are part of the river, but I'm slowly getting there.

Related Solutions

[Tex/LaTex] What’s the difference between \tolerance and \badness

The \tolerance setting influences the paragraph breaking routine itself: changes to \tolerance (and \pretolerance) actually affects which line breaks are chosen. Higher values allow worse lines (usually meaning: with stretched inter-word spaces) to be accepted, with the value 10000 indicating a 'panic mode' where anything at all is acceptable. Normally the lower the value, the better the paragraph will look, but you run the risk of reducing the list of possible breaks so much that you end up with overfull lines.

The \hbadness setting only influences the user report (the messages you see on screen and in the log) about the actually chosen lines, it has no effect on the breaking routine itself.

[Tex/LaTex] Repetition of a word on two lines

One of Don Knuth's recommendations for fixing various typographical issues is to rewrite the passage in question – assuming that doing so is possible and/or permissible, of course. (The passage you cite is one case where you mustn't change a single word, obviously.) If you can't/mustn't rewrite the passage, you can still try to change some parameters such as the line width, font size, interword spacing, and occasionally impose a tie (unbreakable space), all in order to try to mitigate the problem.

Addendum: I've succeeded in reproducing the OP's text fragment in the following MWE:

\documentclass{article}
\usepackage[french]{babel}
\usepackage{kpfonts}
\begin{document}

\begin{minipage}{1.7in}
Je suis venu non pour juger le monde, mais pour sauver le monde. Celui qui me rejette
et qui ne re\c coit pas mes
\end{minipage}

\bigskip
\begin{minipage}{1.7in}
Je suis venu non pour juger~le monde, mais pour sauver le monde. Celui qui me rejette
et qui ne re\c coit pas mes
\end{minipage}

\bigskip
\begin{minipage}{1.6in}
Je suis venu non pour juger le monde, mais pour sauver le monde. Celui qui me rejette
et qui ne re\c coit pas mes
\end{minipage}

\bigskip
\begin{minipage}{1.8in}
Je suis venu non pour juger le monde, mais pour sauver le monde. Celui qui me rejette
et qui ne re\c coit pas mes
\end{minipage}
\end{document}

The first minipage reproduces the initial problem. In example two, I've inserted a tie between "juger" and "le": this forces a hyphenation of the word "juger" and succeeds in breaking up the repetition, at the cost of loose word spacing (given the narrow measure!). The second example does not impose a tie but shortens the measure, also breaking up the vertical word repetition a bit but also suffering from loose word spacing (esp in line 3). The fourth example widens the measure a bit; now lines 2 and 3 both start with "monde" (as opposed to "le monde" in the first example), and the interword spacing looks OK overall. A slight improvement, maybe, but really only very slight. I guess the problem to solve is particularly vexing because the repeated-word group contains two, rather than just one, word!

Best Answer

Related Solutions

[Tex/LaTex] What’s the difference between \tolerance and \badness

[Tex/LaTex] Repetition of a word on two lines

Related Question