A measure that is low when highly skewed raters agree is actually highly desirable. Gwet's AC1 specifically assumes that chance agreement should be at most 50%, but if both raters vote +ve 90% of the time, Cohen and Fleiss/Scott says that chance agreement is 81% on the positives and 1% on the negatives for a total of 82% expected accuracy.
This is precisely the kind of bias that needs to be eliminated. A contingency table of
81 9
9 1
represents chance level performance. Fleiss and Cohen Kappa and Correlation are 0 but AC1 is a misleading 89%. We of course see the accuracy of 82% and also see Recall and Precision and F-measue of 90%, if we considered them in these terms...
Consider two raters, one of whom is a linguist who gives highly reliable part of speech ratings - noun versus verb say, and the other of whom is unbeknownst a computer program which is so hopeless it just guesses.
Since water is a noun 90% of the time, the linguist says noun 90% of the time and verb 10% of the time.
One form of guessing is to label words with their most frequent part of speech, another is to guess the different parts of speech with probability given by their frequency. This latter "prevalence-biased" approach will be rated 0 by all Kappa and Correlation measures, as well as DeltaP, DeltaP', Informedness and Markedness (which are the regression coefficients which give one directional prediction information, and whose geometric mean is the Matthews Correlation). It corresponds to the table above.
The "most frequent" part of speech random tagger gives the following table for 100 words:
90 10
0 0
That is it predicts correctly all 90 the linguist's nouns, but none of the 10 verbs.
All Kappas and Correlations, and Informedness, give this 0, but AC1 gives it a misleading 81%.
Informedness is giving the probability that the tagger is making an informed decision, that is what proportion of the time it is making an informed decision, and correctly returns no.
On the other hand, Markedness is estimating what proportion of the time the linguist is correctly marking the word, and it underestimates 40%. If we considered this in terms of the precision and recall of the program, we have a Precision of 90% (we get the 10% wrong that are verbs), but since we only consider the nouns, we have a Recall of 100% (we get all of them as the computer always guesses noun). But Inverse Recall is 0, and Inverse Precision is undefined as computer makes no -ve predictions (consider the inverse problem where verb is the +ve class, so computer is no always predicting -ve as the more prevalent class).
In the Dichotomous case (two classes) we have
Informedness = Recall + Inverse Recall - 1.
Markedness = Precision + Inverse Precision - 1.
Correlation = GeoMean (Informedness, Markedness).
Short answer - Correlation is best when there is nothing to choose between the raters, otherwise Informedness. If you want to use Kappa and think both raters should have the same distribution use Fleiss, but normally you will want to allow them to have their own scales and use Cohen. I don't know of any example where AC1 would give a more appropriate answer, but in general the unintuitive results come because of mismatches between the biases/prevalences of the two raters' class choices. When bias=prevalence=0.5 all of the measures agree, when the measures disagree it is your assumptions that determine what is appropriate, and the guidelines I've given reflect the corresponding assumptions.
This Water example originated in...
Jim Entwisle and David M. W. Powers (1998), "The Present Use of Statistics in the Evaluation of NLP Parsers", pp215-224, NeMLaP3/CoNLL98 Joint Conference, Sydney, January 1998. - should be cited for all Bookmaker theory/history purpose.
http://david.wardpowers.info/Research/AI/papers/199801a-CoNLL-USE.pdf
http://dl.dropbox.com/u/27743223/199801a-CoNLL-USE.pdf
Informedness and Markedness versus Kappa are explained in...
David M. W. Powers (2012). "The Problem with Kappa". Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop. - cite for work using Informedness or Kappa in an NLP/CL context.
http://aclweb.org/anthology-new/E/E12/E12-1035.pdf
http://dl.dropbox.com/u/27743223/201209-eacl2012-Kappa.pdf
You calculate a single Kappa for all the categories at once. The formula for $\kappa$ is:
$$\kappa = \frac{p_{o}-p_{e}}{1-p_{e}} = \frac{N_{o}-N_{e}}{N-N_{e}}$$
You see, this depends only on $p_{o}$ and $p_{e}$, respectively the observed agreement and the chance agreement. The number of categories does not matter at all.
Say you have a table of observations $\text m_{i,j}$ with $k$ categories.
$\sum_i^k{m_{i,i}}$ is the sum of agreements in the table.
$\sum_j^k{m_{l,j}}$ is the sum of counts in the $l$-th row.
$\sum_i^k{m_{i,l}}$ is the sum of counts in the $l$-th column.
So, when you multiply the total counts in a row by the total counts in the respective column and divide by $N$ you have a conservative estimate of the count of agreements by chance in that category. Summing over all pairs of rows and columns of each category you have an overall estimate of chance agreement, and you can plug that into the $\kappa$ formula.
$$N_o = \sum_i^k{m_{i,i}}$$
$$N_e = \frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)$$
$$\therefore \kappa = \frac{N_{o}-N_{e}}{N-N_{e}} = \frac{\sum_i^k{m_{i,i}}-\frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)}{N-\frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)}$$
And here's a sample code in R:
kappa = function(M) (sum(diag(M)) - 1/sum(M) * sum(colSums(M) * rowSums(M)))/
(sum(M) - 1/sum(M) * sum(colSums(M) * rowSums(M)))
In case of perfect agreement:
> M = table(iris$Species, iris$Species)
> print(kappa(M))
[1] 1
In case of random predictions
> M = table(sample(iris$Species, 150L), iris$Species)
> print(kappa(M))
[1] -0.09
Best Answer
You could use a chance-adjusted agreement index (e.g., Cohen's kappa or Scott's pi) for each category separately. Alternatively, you could use the following approach:
Kramer (1980) proposed a method for assessing inter-rater reliability for tasks in which raters could select multiple categories for each object of measurement. The intuition behind this method is to reframe the problem from one of classification to one of rank ordering. Thus, all selected categories are tied for first place and all non-selected categories are tied for second place. Chance-adjusted agreement can then be calculated using rank correlation coefficients or analysis of variance of the ranks. Naturally, this approach also allows multiple categories to be ranked by raters.
$$ \kappa_0 = \frac{\bar{P} - P_e}{1 - P_e} + \frac{1 - \bar{P}}{Nm_0(1 - P_e)} $$ where $\bar{P}$ is the average proportion of concordant pairs out of all possible pairs of observations for each subject, $P_e=\sum_j p_j^2$ and $p_j$ is the overall proportion of observations in which response category $j$ was selected, $m_0$ is the number of observations per subject, and $N$ is the number of subjects. It can also be shown that, when only one category is selected, $\kappa_0$ asymptotically approaches Cohen's and Fleiss' kappa coefficients.
A clever solution, but not one that I've ever seen used in an article.
References
Kraemer, H. C. (1980). Extension of the kappa coefficient. Biometrics, 36(2), 207–16.