I recently read the paper on Salesforce's advances in abstractive text summarisation. This states that the ROUGE-1 score achieved of 41.16 is significantly better than the previous state of the art.
I also read this paper on (mainly extractive) text summarisation techniques. This compares ROUGE-1 scores of various text summarisation methods and shows ROUGE-1 scores between 0.3788 to 0.5190.
I assume that the scores cited above are just using different scales, but even so I am finding it hard to get a clear understanding of how ROUGE works. This SO question says that ROUGE measures recall, but that is contradicted by this post that covers both precision and recall.
I can understand that a higher score shows improvement over previous scores. I can also understand that abstractive text summarisation is harder than extractive text summarisation. Presumably, as a researcher you would always be trying to get a better score than the previous techniques. But as a user of these methods I need to gauge how far I can rely on the algorithms and how far I need to use humans to do some post-processing on the summarisations.
So my question has two sides to it:
- What is the best way to really understand what a ROUGE score
actually measures? - How "good" is a particular absolute ROUGE score? I'm defining "good" as "minimises the need for human post-processing".
Best Answer
There are two aspects that may impact the need for human post-processing:
ROUGE doesn't try to assess how fluent the summary: ROUGE only tries to assess the adequacy, by simply counting how many n-grams in your generated summary matches the n-grams in your reference summary (or summaries, as ROUGE supports multi-reference corpora).
From https://en.wikipedia.org/w/index.php?title=Automatic_summarization&oldid=808057887#Document_summarization:
Note that BLEU has the same issue, as you can see on these correlation plots, taken from {1}:
In short and approximately:
ROUGE is more interpretable than BLEU (from {2}: "Other Known Deficiencies of Bleu: Scores hard to interpret"). I said approximately because the original ROUGE implementation from the paper that introduced ROUGE {3} may perform a few more things such as stemming.
References: