[Tex/LaTex] How to define the badness of a river


I've written an algorithm to try and detect rivers in paragraphs and it actually detects quite a lot when I run it. Some of them are clearly false positives, but there are others that are indeed aligned spaces on consecutive lines. Here are some, colored in green in the following picture:

a collection of rivers

When are rivers really problematic and/or ugly? Are there rivers in this example that are worth fixing?

What are the parameters (and their importance) to qualify the "badness" of a river, and how could they be calculated?

As an additional question, there doesn't seem to be a standard definition of a river. Defining a river properly would surely help to define the parameters that make it bad. How would you define a river?

Best Answer

After reading lockstep's and Lev's answers, here is my own take. It seems to me that there's 4 main factors that make a river bad:

  1. Its orientation: The straighter, the worse;
  2. Its width: the larger, the worse;
  3. The constancy of its width: the more constant, the worse;
  4. The length: the longer, the worse.

From this, I guess I could try to improve the algorithm by checking the following:

  1. Find overlapping spaces (which I already do) on as many lines as possible (instead of just 3);
  2. Try to approximate the river by a linear regression and retrieve a regression factor;
  3. Measure the width of each node, and calculate mean μ and standard deviation σ.
  4. Based on all this, calculate the badness based on:
    • the regression factor (the closest to a straight line, the worse), which might be some kind of MSE,
    • the mean width μ divided by the standard space width ω (the larger -- compared to the standard word space, the worse),
    • the standard deviation of width σ (the smaller, the worse),
    • the length (number of lines n, the longer, the worse).

The badness could be something like (with a factor α to normalize it):

suggested formula


alpha=10000 w

to set the maximum badness to 10000.

I suggest to square σ and MSE since they appear to be more important factors than μ and n.

With this formula, we would have:

  • b tends towards 10000 when μ tends towards infinity (maximum badness for very large spaces);
  • b tends towards 10000 when n tends towards infinity (maximum badness for a lot of lines);
  • b tends towards 0 when μ tends towards 0 (smaller spaces reduce badness);
  • b tends towards 0 when n tends towards 0 (smaller amount of lines reduce badness);
  • b tends towards 10000 when σ tends towards 0 (monospaced text increases badness);
  • b tends towards 10000 when MSE tends towards 0 (perfectly aligned spaces are sure to be really bad);
  • b tends towards 0 when σ tends towards infinity (different spaces tend to reduce badness);
  • b tends towards 0 when MSE tends towards infinity (unaligned spaces do not lead to real rivers).

My definition of a river would then be:

an accidental series of aligned spaces of constant width on 3 or more consecutive lines.

Edit: As Bruno noted, α and ω are not really used in the calculation since we fix the maximum badness to 10000 anyway. Also the algorithm can be simplified by not calculating μ since is simply the sum of all widths:

enter image description here


enter image description here

Edit 2: I'm actually considering to use something like S+ σ + MSE in the denominator instead of S * (σ * MSE)^2. The reasons for that are:

  • When σ is zero (perfectly identical spaces), that doesn't make the river necessarily bad, it still depends on MSE (the alignment);
  • When MSE is zero (perfectly aligned spaces), that doesn't make the river necessarily bad, it still depends on the size of spaces;
  • I'm not sure squares are necessary for σ and MSE (but experiments will have to tell) since they're already squared differences.

As a little progress note, here is Lev's excellent example converted to LuaTeX + fontspec:

\setmainfont[Ligatures=TeX]{Minion Pro}
eget niis non lobero at conseyquat lacus. Vestibulum eg
Lorem ipsum dolor sit amew jonsectetun ad PL Wilson elit
Pellentesque nec turpis nisv Ac lobortis ballacus.  Ut fringil
nis, non ipsum gravida sep `doltrices' odio dictub.  Tam id l
fermintum dolor. Pail NT cabitant morbi istiqleith  vibendi
senectus et netus erepaw dalesuada fames ic turpak  wegest
Nam ac nunc vel nique.  aliquam dictum etat magna.  Thats
risus neque. `Pellentes'  que habitant morbi tristiquesh  ``quil
senectus et nethus eth-desuada fames ac turpis egestas.}

and ran into my current algorithm:

a bad detection

It detects quite a few things... except the 2 mighty rivers, which don't actually overlap on 3 consecutive lines... There's still quite some work to do...

As for the overlapping issue, it seems to me that the bigger the interline space, the more space between spaces is possible. If lines are very close, spaces really have to overlap in order to create a river, but if lines are very loose, then spaces that are actually distant horizontally can create a diagonal river, too.

Update: I considered that rivers are below 45° (with a vertical line), and in this case, the overlap can be taken + or - the line height. So the new algorithm considers that spaces do not necessarily have to overlap strictly vertically, but the overlap can be + or - the distance between the two lines. The result with Lev's example is this:

Allowing + or - line height in overlap

Next step will be to analyze on more than 3 lines (as I still do) and define and apply a river badness to eliminate false positive rivers. This seems to be a bit harder since I have to define a list object in Lua to chain the nodes that are part of the river, but I'm slowly getting there.