Solved – nltk multi_kappa (Davies and Fleiss) or alpha (Krippendorff)

agreement-statisticscohens-kappanltkpython

I'm using inter-rater agreement to evaluate the agreement in my rating dataset. I have a set of N examples distributed among M raters. Not all raters voted every item, so I have N x M votes as the upper bound.
So let's say the rater i gives the following votes the the N items, for a given N=5 and M=3, where in the array at position j there is the j-th item:

rater[1] = [1,3,0,5,5]
rater[2] = [0,3,1,5,2]
rater[3] = [1,2,0,5,3]

where 0 meaning that the voter did not express any option about item in position j.
Now, I cannot use the Cohen's Kappa, since it requires to have almost two rathers, so I think to use the Alpha Krippendorff of NLTK or the multi-kappa.

In my dataset

Votes eventually can be sparse, i.e there can be items that have few votes hence like the worst case of
```
rater[i] = [0, 0, ...,j, ..., 0]
```

so the item j could have just one vote by the rater i in the whole dataset.

Each item must have at least one vote, hence there are no items with a zero array.
The numbers of raters M is less than the numbers of items N, M < N.

Which is the best approach as for the NLTK metrics package implementation?

Best Answer

I found this solution and it was useful for my case so you probably check it out. Here's the possible solution for your dataset.

import krippendorff
data = [[1,3,0,5,5],
       [0,3,1,5,2],
       [1,2,0,5,3]]    

kappa = krippendorff.alpha(data)
print(kappa)

It works with Python 3.4+ and don't forget to install dependencies

pip install numpy krippendorff

Related Solutions

Solved – Inter-rater statistic for skewed rankings

A measure that is low when highly skewed raters agree is actually highly desirable. Gwet's AC1 specifically assumes that chance agreement should be at most 50%, but if both raters vote +ve 90% of the time, Cohen and Fleiss/Scott says that chance agreement is 81% on the positives and 1% on the negatives for a total of 82% expected accuracy.

This is precisely the kind of bias that needs to be eliminated. A contingency table of

81 9
9 1

represents chance level performance. Fleiss and Cohen Kappa and Correlation are 0 but AC1 is a misleading 89%. We of course see the accuracy of 82% and also see Recall and Precision and F-measue of 90%, if we considered them in these terms...

Consider two raters, one of whom is a linguist who gives highly reliable part of speech ratings - noun versus verb say, and the other of whom is unbeknownst a computer program which is so hopeless it just guesses.

Since water is a noun 90% of the time, the linguist says noun 90% of the time and verb 10% of the time.

One form of guessing is to label words with their most frequent part of speech, another is to guess the different parts of speech with probability given by their frequency. This latter "prevalence-biased" approach will be rated 0 by all Kappa and Correlation measures, as well as DeltaP, DeltaP', Informedness and Markedness (which are the regression coefficients which give one directional prediction information, and whose geometric mean is the Matthews Correlation). It corresponds to the table above.

The "most frequent" part of speech random tagger gives the following table for 100 words:

90 10
0 0

That is it predicts correctly all 90 the linguist's nouns, but none of the 10 verbs.
All Kappas and Correlations, and Informedness, give this 0, but AC1 gives it a misleading 81%.

Informedness is giving the probability that the tagger is making an informed decision, that is what proportion of the time it is making an informed decision, and correctly returns no.

On the other hand, Markedness is estimating what proportion of the time the linguist is correctly marking the word, and it underestimates 40%. If we considered this in terms of the precision and recall of the program, we have a Precision of 90% (we get the 10% wrong that are verbs), but since we only consider the nouns, we have a Recall of 100% (we get all of them as the computer always guesses noun). But Inverse Recall is 0, and Inverse Precision is undefined as computer makes no -ve predictions (consider the inverse problem where verb is the +ve class, so computer is no always predicting -ve as the more prevalent class).

In the Dichotomous case (two classes) we have

Informedness = Recall + Inverse Recall - 1. Markedness = Precision + Inverse Precision - 1. Correlation = GeoMean (Informedness, Markedness).

Short answer - Correlation is best when there is nothing to choose between the raters, otherwise Informedness. If you want to use Kappa and think both raters should have the same distribution use Fleiss, but normally you will want to allow them to have their own scales and use Cohen. I don't know of any example where AC1 would give a more appropriate answer, but in general the unintuitive results come because of mismatches between the biases/prevalences of the two raters' class choices. When bias=prevalence=0.5 all of the measures agree, when the measures disagree it is your assumptions that determine what is appropriate, and the guidelines I've given reflect the corresponding assumptions.

This Water example originated in...

Jim Entwisle and David M. W. Powers (1998), "The Present Use of Statistics in the Evaluation of NLP Parsers", pp215-224, NeMLaP3/CoNLL98 Joint Conference, Sydney, January 1998. - should be cited for all Bookmaker theory/history purpose. http://david.wardpowers.info/Research/AI/papers/199801a-CoNLL-USE.pdf http://dl.dropbox.com/u/27743223/199801a-CoNLL-USE.pdf

Informedness and Markedness versus Kappa are explained in...

David M. W. Powers (2012). "The Problem with Kappa". Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop. - cite for work using Informedness or Kappa in an NLP/CL context. http://aclweb.org/anthology-new/E/E12/E12-1035.pdf http://dl.dropbox.com/u/27743223/201209-eacl2012-Kappa.pdf

Solved – Limitations of Cohen’s kappa for sparse data

Cohens Kappa is known to have limitations for skewed datasets.

Quoting an example from here

Consider following matrix :

+---+----+
| 1 | 6  |
+---+----+
| 9 | 84 |
+---+----+

The above example has an observed agreement of 0.85 but the Cohen' Kappa is 0.04.

The solution suggested in this article is to report two separate agreement metrics for positive and negative classes.

Best Answer

Related Solutions

Solved – Inter-rater statistic for skewed rankings

Solved – Limitations of Cohen’s kappa for sparse data

Related Question