Solved – Interpreting ROUGE scores – Math Solves Everything

I recently read the paper on Salesforce's advances in abstractive text summarisation. This states that the ROUGE-1 score achieved of 41.16 is significantly better than the previous state of the art.

I also read this paper on (mainly extractive) text summarisation techniques. This compares ROUGE-1 scores of various text summarisation methods and shows ROUGE-1 scores between 0.3788 to 0.5190.

I assume that the scores cited above are just using different scales, but even so I am finding it hard to get a clear understanding of how ROUGE works. This SO question says that ROUGE measures recall, but that is contradicted by this post that covers both precision and recall.

I can understand that a higher score shows improvement over previous scores. I can also understand that abstractive text summarisation is harder than extractive text summarisation. Presumably, as a researcher you would always be trying to get a better score than the previous techniques. But as a user of these methods I need to gauge how far I can rely on the algorithms and how far I need to use humans to do some post-processing on the summarisations.

So my question has two sides to it:

What is the best way to really understand what a ROUGE score
actually measures?
How "good" is a particular absolute ROUGE score? I'm defining "good" as "minimises the need for human post-processing".

Best Answer

As a user of these methods I need to gauge how far I can rely on the algorithms and how far I need to use humans to do some post-processing on the summarisations.

How "good" is a particular absolute ROUGE score? I'm defining "good" as "minimises the need for human post-processing".

There are two aspects that may impact the need for human post-processing:

Does the summary sound fluent?
Is summary adequate? I.e. is the length appropriate and does it cover the most important information of the text it summarizes?

ROUGE doesn't try to assess how fluent the summary: ROUGE only tries to assess the adequacy, by simply counting how many n-grams in your generated summary matches the n-grams in your reference summary (or summaries, as ROUGE supports multi-reference corpora).

From https://en.wikipedia.org/w/index.php?title=Automatic_summarization&oldid=808057887#Document_summarization:

If there are multiple references, the ROUGE-1 scores are averaged. Because ROUGE is based only on content overlap, it can determine if the same general concepts are discussed between an automatic summary and a reference summary, but it cannot determine if the result is coherent or the sentences flow together in a sensible manner. High-order n-gram ROUGE measures try to judge fluency to some degree. Note that ROUGE is similar to the BLEU measure for machine translation, but BLEU is precision- based, because translation systems favor accuracy.

Note that BLEU has the same issue, as you can see on these correlation plots, taken from {1}:

What is the best way to really understand what a ROUGE score actually measures?

In short and approximately:

ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.

ROUGE is more interpretable than BLEU (from {2}: "Other Known Deficiencies of Bleu: Scores hard to interpret"). I said approximately because the original ROUGE implementation from the paper that introduced ROUGE {3} may perform a few more things such as stemming.

References:

{1} Callison-Burch, Chris, Miles Osborne, and Philipp Koehn. "Re-evaluation the Role of Bleu in Machine Translation Research." In EACL, vol. 6, pp. 249-256. 2006. https://scholar.google.com/scholar?cluster=8900239586727494087&hl=en&as_sdt=0,5 ;
{2} Slides of 1: https://pdfs.semanticscholar.org/60f4/f98ff57be60a786803a88f5e7e970b35c79e.pdf (mirror)
{3} Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text summarization branches out: Proceedings of the ACL-04 workshop, vol. 8. 2004. https://scholar.google.com/scholar?cluster=2397172516759442154&hl=en&as_sdt=0,5 ; http://anthology.aclweb.org/W/W04/W04-1013.pdf

Best Answer

Related Solutions

Solved – ROUGE scores for extractive vs abstractive text summarization

Related Question