Solved – ROUGE scores for extractive vs abstractive text summarization

The ROUGE score (scores) allows us to measure (although not in a perfect way) the quality of our text summarization by computing the frequency of overlapping n-grams between our produced summary and the reference one (or ones – usually created by humans).

I'm a bit confused on how to use this metric in a extractive vs abstractive summarization setting, in particular how the ROUGE is dependent on the choice of words for the reference summary.

Imagine that the human that created the reference summary chose to build a summary using exclusively (or most of the) words from the original text. Now we build a good extractive model that is able to extract words from the text. These n-grams match pretty well with reference summary, hence we get high ROUGE scores.

Now we build an abstractive model over the same text. The model works pretty well, from a human perspective it creates a fluent and nice summary, but using synonyms or in general words not present in the text and therefore not in the reference summary, hence producing a low ROUGE score despite being a good summary.

Now flip the setting: from the same text the human created a reference summary that is mostly not composed by words within the text. It is indeed more fluent and more abstracted wrt the previous case. Now our extractive model has low ROUGE scores because it simply uses words from the text, hence we get few matches. On the other side the abstractive model might or might not have high ROUGE scores depending on whether the model has produced the same words present in the reference summary.

Using these points it looks like the choice of the human who created the reference summary influences our choice of using an extractive rather than an abstractive model. If the reference summary has been created using words from the text the extractive model would give us better ROUGE scores, and viceversa for the abstractive model.

Does this make sense? I feel like I'm missing something. This dependence from the human choice on how to structure the reference summary would be way too influential not only on our choice of using extractive vs abstractive methods, but also on their performance and in general on whether the ROUGE scores actually are good metrics for our models.

Best Answer

Your concerns about the ROUGE score are indeed correct. The hope that is the test data are large and diverse enough, the ROUGE score should on average get a reasonable number. There are several papers that study this problem in detail:

Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries from 2008 shows that there is only a very mild correlation between human opinions on summary quality and ROUGE score when they used only human-written summaries.

The correlation with human judgment is much better on machine-generated summaries as shows paper Better Summarization Evaluation with Word Embeddings for ROUGE from 2015.

For comparison, the Spearman correlation with the human judgment of machine translation metrics is typically over 0.95 for English.

Best Answer

Related Solutions

Solved – Why is n-gram used in text language identification instead of words

Solved – Recall and precision in text summarization

Related Question