Solved – Interpreting ROUGE scores

rougetext-summarization

I recently read the paper on Salesforce's advances in abstractive text summarisation. This states that the ROUGE-1 score achieved of 41.16 is significantly better than the previous state of the art.

I also read this paper on (mainly extractive) text summarisation techniques. This compares ROUGE-1 scores of various text summarisation methods and shows ROUGE-1 scores between 0.3788 to 0.5190.

I assume that the scores cited above are just using different scales, but even so I am finding it hard to get a clear understanding of how ROUGE works. This SO question says that ROUGE measures recall, but that is contradicted by this post that covers both precision and recall.

I can understand that a higher score shows improvement over previous scores. I can also understand that abstractive text summarisation is harder than extractive text summarisation. Presumably, as a researcher you would always be trying to get a better score than the previous techniques. But as a user of these methods I need to gauge how far I can rely on the algorithms and how far I need to use humans to do some post-processing on the summarisations.

So my question has two sides to it:

  1. What is the best way to really understand what a ROUGE score
    actually measures?
  2. How "good" is a particular absolute ROUGE score? I'm defining "good" as "minimises the need for human post-processing".

Best Answer

As a user of these methods I need to gauge how far I can rely on the algorithms and how far I need to use humans to do some post-processing on the summarisations.

How "good" is a particular absolute ROUGE score? I'm defining "good" as "minimises the need for human post-processing".

There are two aspects that may impact the need for human post-processing:

  • Does the summary sound fluent?
  • Is summary adequate? I.e. is the length appropriate and does it cover the most important information of the text it summarizes?

ROUGE doesn't try to assess how fluent the summary: ROUGE only tries to assess the adequacy, by simply counting how many n-grams in your generated summary matches the n-grams in your reference summary (or summaries, as ROUGE supports multi-reference corpora).

From https://en.wikipedia.org/w/index.php?title=Automatic_summarization&oldid=808057887#Document_summarization:

If there are multiple references, the ROUGE-1 scores are averaged. Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent or the sentences flow together in a sensible manner. High-order n-gram ROUGE measures try to judge fluency to some degree. Note that ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision- based, because translation systems favor accuracy.

Note that BLEU has the same issue, as you can see on these correlation plots, taken from {1}:

enter image description here

What is the best way to really understand what a ROUGE score actually measures?

In short and approximately:

  • ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
  • ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
  • ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.

ROUGE is more interpretable than BLEU (from {2}: "Other Known Deficiencies of Bleu: Scores hard to interpret"). I said approximately because the original ROUGE implementation from the paper that introduced ROUGE {3} may perform a few more things such as stemming.


References:

Related Question