Solved – Inter-rater statistic for skewed rankings

agreement-statistics

I have several sets of 10 raters which I want to compare.

Each rater can cast only Yes or No vote, however this decision is skewed and the Yes votes make only about 10% of all votes (and this is expected, i.e. the such proportion is
objectively true).

Which of the inter-rater agreement statistics would be suitable in this case?

Best Answer

A measure that is low when highly skewed raters agree is actually highly desirable. Gwet's AC1 specifically assumes that chance agreement should be at most 50%, but if both raters vote +ve 90% of the time, Cohen and Fleiss/Scott says that chance agreement is 81% on the positives and 1% on the negatives for a total of 82% expected accuracy.

This is precisely the kind of bias that needs to be eliminated. A contingency table of

81 9
9 1

represents chance level performance. Fleiss and Cohen Kappa and Correlation are 0 but AC1 is a misleading 89%. We of course see the accuracy of 82% and also see Recall and Precision and F-measue of 90%, if we considered them in these terms...

Consider two raters, one of whom is a linguist who gives highly reliable part of speech ratings - noun versus verb say, and the other of whom is unbeknownst a computer program which is so hopeless it just guesses.

Since water is a noun 90% of the time, the linguist says noun 90% of the time and verb 10% of the time.

One form of guessing is to label words with their most frequent part of speech, another is to guess the different parts of speech with probability given by their frequency. This latter "prevalence-biased" approach will be rated 0 by all Kappa and Correlation measures, as well as DeltaP, DeltaP', Informedness and Markedness (which are the regression coefficients which give one directional prediction information, and whose geometric mean is the Matthews Correlation). It corresponds to the table above.

The "most frequent" part of speech random tagger gives the following table for 100 words:

90 10
0 0

That is it predicts correctly all 90 the linguist's nouns, but none of the 10 verbs.
All Kappas and Correlations, and Informedness, give this 0, but AC1 gives it a misleading 81%.

Informedness is giving the probability that the tagger is making an informed decision, that is what proportion of the time it is making an informed decision, and correctly returns no.

On the other hand, Markedness is estimating what proportion of the time the linguist is correctly marking the word, and it underestimates 40%. If we considered this in terms of the precision and recall of the program, we have a Precision of 90% (we get the 10% wrong that are verbs), but since we only consider the nouns, we have a Recall of 100% (we get all of them as the computer always guesses noun). But Inverse Recall is 0, and Inverse Precision is undefined as computer makes no -ve predictions (consider the inverse problem where verb is the +ve class, so computer is no always predicting -ve as the more prevalent class).

In the Dichotomous case (two classes) we have

Informedness = Recall + Inverse Recall - 1. Markedness = Precision + Inverse Precision - 1. Correlation = GeoMean (Informedness, Markedness).

Short answer - Correlation is best when there is nothing to choose between the raters, otherwise Informedness. If you want to use Kappa and think both raters should have the same distribution use Fleiss, but normally you will want to allow them to have their own scales and use Cohen. I don't know of any example where AC1 would give a more appropriate answer, but in general the unintuitive results come because of mismatches between the biases/prevalences of the two raters' class choices. When bias=prevalence=0.5 all of the measures agree, when the measures disagree it is your assumptions that determine what is appropriate, and the guidelines I've given reflect the corresponding assumptions.

This Water example originated in...

Jim Entwisle and David M. W. Powers (1998), "The Present Use of Statistics in the Evaluation of NLP Parsers", pp215-224, NeMLaP3/CoNLL98 Joint Conference, Sydney, January 1998. - should be cited for all Bookmaker theory/history purpose. http://david.wardpowers.info/Research/AI/papers/199801a-CoNLL-USE.pdf http://dl.dropbox.com/u/27743223/199801a-CoNLL-USE.pdf

Informedness and Markedness versus Kappa are explained in...

David M. W. Powers (2012). "The Problem with Kappa". Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop. - cite for work using Informedness or Kappa in an NLP/CL context. http://aclweb.org/anthology-new/E/E12/E12-1035.pdf http://dl.dropbox.com/u/27743223/201209-eacl2012-Kappa.pdf

Best Answer

Related Solutions

Inter-Rater Reliability – Methods for Ordinal and Interval Data

Solved – Assessing and testing inter-rater agreement with kappa statistic on a set of binary and Likert items

Related Question