I've written an algorithm to try and detect rivers in paragraphs and it actually detects quite a lot when I run it. Some of them are clearly false positives, but there are others that are indeed aligned spaces on consecutive lines. Here are some, colored in green in the following picture:
When are rivers really problematic and/or ugly? Are there rivers in this example that are worth fixing?
What are the parameters (and their importance) to qualify the "badness" of a river, and how could they be calculated?
As an additional question, there doesn't seem to be a standard definition of a river. Defining a river properly would surely help to define the parameters that make it bad. How would you define a river?
Best Answer
After reading lockstep's and Lev's answers, here is my own take. It seems to me that there's 4 main factors that make a river bad:
From this, I guess I could try to improve the algorithm by checking the following:
The badness could be something like (with a factor α to normalize it):
where
to set the maximum badness to 10000.
I suggest to square σ and MSE since they appear to be more important factors than μ and n.
With this formula, we would have:
My definition of a river would then be:
Edit: As Bruno noted, α and ω are not really used in the calculation since we fix the maximum badness to 10000 anyway. Also the algorithm can be simplified by not calculating μ since nμ is simply the sum of all widths:
with:
Edit 2: I'm actually considering to use something like
S+ σ + MSE
in the denominator instead ofS * (σ * MSE)^2
. The reasons for that are:As a little progress note, here is Lev's excellent example converted to LuaTeX + fontspec:
and ran into my current algorithm:
It detects quite a few things... except the 2 mighty rivers, which don't actually overlap on 3 consecutive lines... There's still quite some work to do...
As for the overlapping issue, it seems to me that the bigger the interline space, the more space between spaces is possible. If lines are very close, spaces really have to overlap in order to create a river, but if lines are very loose, then spaces that are actually distant horizontally can create a diagonal river, too.
Update: I considered that rivers are below 45° (with a vertical line), and in this case, the overlap can be taken + or - the line height. So the new algorithm considers that spaces do not necessarily have to overlap strictly vertically, but the overlap can be + or - the distance between the two lines. The result with Lev's example is this:
Next step will be to analyze on more than 3 lines (as I still do) and define and apply a river badness to eliminate false positive rivers. This seems to be a bit harder since I have to define a list object in Lua to chain the nodes that are part of the river, but I'm slowly getting there.