The Kappa ($\kappa$) statistic is a quality index that compares observed agreement between 2 raters on a nominal or ordinal scale with agreement expected by chance alone (as if raters were tossing up). Extensions for the case of multiple raters exist (2, pp. 284–291). In the case of ordinal data, you can use the weighted $\kappa$, which basically reads as usual $\kappa$ with off-diagonal elements contributing to the measure of agreement. Fleiss (3) provided guidelines to interpret $\kappa$ values but these are merely rules of thumbs.
The $\kappa$ statistic is asymptotically equivalent to the ICC estimated from a two-way random effects ANOVA, but significance tests and SE coming from the usual ANOVA framework are not valid anymore with binary data. It is better to use bootstrap to get confidence interval (CI). Fleiss (8) discussed the connection between weighted kappa and the intraclass correlation (ICC).
It should be noted that some psychometricians don't very much like $\kappa$ because it is affected by the prevalence of the object of measurement much like predictive values are affected by the prevalence of the disease under consideration, and this can lead to paradoxical results.
Inter-rater reliability for $k$ raters can be estimated with Kendall’s coefficient of concordance, $W$. When the number of items or units that are rated $n > 7$, $k(n − 1)W \sim \chi^2(n − 1)$. (2, pp. 269–270). This asymptotic approximation is valid for moderate value of $n$ and $k$ (6), but with less than 20 items $F$ or permutation tests are more suitable (7). There is a close relationship between Spearman’s $\rho$ and Kendall’s $W$ statistic: $W$ can be directly calculated from the mean of the pairwise Spearman correlations (for untied observations only).
Polychoric (ordinal data) correlation may also be used as a measure of inter-rater agreement. Indeed, they allow to
- estimate what would be the correlation if ratings were made on a continuous scale,
- test marginal homogeneity between raters.
In fact, it can be shown that it is a special case of latent trait modeling, which allows to relax distributional assumptions (4).
About continuous (or so assumed) measurements, the ICC which quantifies the proportion of variance attributable to the between-subject variation is fine. Again, bootstraped CIs are recommended. As @ars said, there are basically two versions -- agreement and consistency -- that are applicable in the case of agreement studies (5), and that mainly differ on the way sum of squares are computed; the “consistency” ICC is generally estimated without considering the Item×Rater interaction. The ANOVA framework is useful with specific block design where one wants to minimize the number of ratings (BIBD) -- in fact, this was one of the original motivation of Fleiss's work. It is also the best way to go for multiple raters. The natural extension of this approach is called the Generalizability Theory. A brief overview is given in Rater Models: An Introduction, otherwise the standard reference is Brennan's book, reviewed in Psychometrika 2006 71(3).
As for general references, I recommend chapter 3 of Statistics in Psychiatry, from Graham Dunn (Hodder Arnold, 2000). For a more complete treatment of reliability studies, the best reference to date is
Dunn, G (2004). Design and Analysis of
Reliability Studies. Arnold. See the
review in the International Journal
of Epidemiology.
A good online introduction is available on John Uebersax's website, Intraclass Correlation and Related Methods; it includes a discussion of the pros and cons of the ICC approach, especially with respect to ordinal scales.
Relevant R packages for two-way assessment (ordinal or continuous measurements) are found in the Psychometrics Task View; I generally use either the psy, psych, or irr packages. There's also the concord package but I never used it. For dealing with more than two raters, the lme4 package is the way to go for it allows to easily incorporate random effects, but most of the reliability designs can be analysed using the aov()
because we only need to estimate variance components.
References
- J Cohen. Weighted kappa: Nominal scale agreement with provision for scales disagreement of partial credit. Psychological Bulletin, 70, 213–220, 1968.
- S Siegel and Jr N John Castellan. Nonparametric Statistics for the Behavioral
Sciences. McGraw-Hill, Second edition, 1988.
- J L Fleiss. Statistical Methods for Rates and Proportions. New York: Wiley, Second
edition, 1981.
- J S Uebersax. The tetrachoric and polychoric correlation coefficients. Statistical Methods for Rater Agreement web site, 2006. Available at: http://john-uebersax.com/stat/tetra.htm. Accessed February 24, 2010.
- P E Shrout and J L Fleiss. Intraclass correlation: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428, 1979.
- M G Kendall and B Babington Smith. The problem of m rankings. Annals of Mathematical Statistics, 10, 275–287, 1939.
- P Legendre. Coefficient of concordance. In N J Salkind, editor, Encyclopedia of Research Design. SAGE Publications, 2010.
- J L Fleiss. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613-619, 1973.
Best Answer
A measure that is low when highly skewed raters agree is actually highly desirable. Gwet's AC1 specifically assumes that chance agreement should be at most 50%, but if both raters vote +ve 90% of the time, Cohen and Fleiss/Scott says that chance agreement is 81% on the positives and 1% on the negatives for a total of 82% expected accuracy.
This is precisely the kind of bias that needs to be eliminated. A contingency table of
81 9
9 1
represents chance level performance. Fleiss and Cohen Kappa and Correlation are 0 but AC1 is a misleading 89%. We of course see the accuracy of 82% and also see Recall and Precision and F-measue of 90%, if we considered them in these terms...
Consider two raters, one of whom is a linguist who gives highly reliable part of speech ratings - noun versus verb say, and the other of whom is unbeknownst a computer program which is so hopeless it just guesses.
Since water is a noun 90% of the time, the linguist says noun 90% of the time and verb 10% of the time.
One form of guessing is to label words with their most frequent part of speech, another is to guess the different parts of speech with probability given by their frequency. This latter "prevalence-biased" approach will be rated 0 by all Kappa and Correlation measures, as well as DeltaP, DeltaP', Informedness and Markedness (which are the regression coefficients which give one directional prediction information, and whose geometric mean is the Matthews Correlation). It corresponds to the table above.
The "most frequent" part of speech random tagger gives the following table for 100 words:
90 10
0 0
That is it predicts correctly all 90 the linguist's nouns, but none of the 10 verbs.
All Kappas and Correlations, and Informedness, give this 0, but AC1 gives it a misleading 81%.
Informedness is giving the probability that the tagger is making an informed decision, that is what proportion of the time it is making an informed decision, and correctly returns no.
On the other hand, Markedness is estimating what proportion of the time the linguist is correctly marking the word, and it underestimates 40%. If we considered this in terms of the precision and recall of the program, we have a Precision of 90% (we get the 10% wrong that are verbs), but since we only consider the nouns, we have a Recall of 100% (we get all of them as the computer always guesses noun). But Inverse Recall is 0, and Inverse Precision is undefined as computer makes no -ve predictions (consider the inverse problem where verb is the +ve class, so computer is no always predicting -ve as the more prevalent class).
In the Dichotomous case (two classes) we have
Informedness = Recall + Inverse Recall - 1. Markedness = Precision + Inverse Precision - 1. Correlation = GeoMean (Informedness, Markedness).
Short answer - Correlation is best when there is nothing to choose between the raters, otherwise Informedness. If you want to use Kappa and think both raters should have the same distribution use Fleiss, but normally you will want to allow them to have their own scales and use Cohen. I don't know of any example where AC1 would give a more appropriate answer, but in general the unintuitive results come because of mismatches between the biases/prevalences of the two raters' class choices. When bias=prevalence=0.5 all of the measures agree, when the measures disagree it is your assumptions that determine what is appropriate, and the guidelines I've given reflect the corresponding assumptions.
This Water example originated in...
Jim Entwisle and David M. W. Powers (1998), "The Present Use of Statistics in the Evaluation of NLP Parsers", pp215-224, NeMLaP3/CoNLL98 Joint Conference, Sydney, January 1998. - should be cited for all Bookmaker theory/history purpose. http://david.wardpowers.info/Research/AI/papers/199801a-CoNLL-USE.pdf http://dl.dropbox.com/u/27743223/199801a-CoNLL-USE.pdf
Informedness and Markedness versus Kappa are explained in...
David M. W. Powers (2012). "The Problem with Kappa". Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop. - cite for work using Informedness or Kappa in an NLP/CL context. http://aclweb.org/anthology-new/E/E12/E12-1035.pdf http://dl.dropbox.com/u/27743223/201209-eacl2012-Kappa.pdf