Solved – Computing Cohen’s Kappa variance (and standard errors)

The Kappa ($\kappa$) statistic was introduced in 1960 by Cohen [1] to measure agreement between two raters. Its variance, however, had been a source of contradictions for quite a some time.

My question is about which is the best variance calculation to be used with large samples. I am inclined to believe the one tested and verified by Fleiss [2] would be the right choice, but this does not seem to be the only published one which seems to be correct (and used throughout fairly recent literature).

Right now I have two concrete ways to compute its asymptotic large sample variance:

The corrected method published by Fleiss, Cohen and Everitt [2];
The delta method which can be found in the book by Colgaton, 2009 [4] (page 106).

To illustrate some of this confusion, here is a quote by Fleiss, Cohen and Everitt [2], emphasis mine:

Many human endeavors have been cursed with repeated failures before
final success is achieved. The scaling of Mount Everest is one
example. The discovery of the Northwest Passage is a second. The
derivation of a correct standard error for kappa is a third.

So, here is a small summary of what happened:

1960: Cohen publishes his paper "A coefficient of agreement for nominal scales" [1] introducing his chance-corrected measure of agreement between two raters called $\kappa$. However, he publishes incorrect formulas for the variance calculations.
1968: Everitt attempts to correct them, but his formulas were incorrect as well.
1969: Fleiss, Cohen and Everitt publish the correct formulas in the paper "Large Sample Standard Errors Of Kappa and Weighted Kappa" [2].
1971: Fleiss publishes another $\kappa$ statistic (but a different one) under the same name, with incorrect formulas for the variances.
1979: Fleiss Nee and Landis publish the corrected formulas for Fleiss' $\kappa$.

At first, consider the following notation. This notation implies the summation operator should be applied to all elements in the dimension over which the dot is placed:

$\ \ \ p_{i.} = \displaystyle\sum_{j=1}^{k} p_{ij}$
$\ \ \ p_{.j} = \displaystyle\sum_{i=1}^{k} p_{ij}$

Now, one can compute Kappa as:

$\ \ \ \hat\kappa = \displaystyle\frac{p_o-p_c}{1-p_e}$

In which

$\ \ \ p_o = \displaystyle\sum_{i=1}^{k} p_{ii}$ is the observed agreement, and

$\ \ \ p_c = \displaystyle\sum_{i=1}^{k} p_{i.} p_{.i}$ is the chance agreement.

So far, the correct variance calculation for Cohen's $\kappa$ is given by:

$\ \ \ \newcommand{\var}{\mathrm{var}}\widehat{\var}(\hat{\kappa}) = \frac{1}{N(1-p_c)^4} \{ \displaystyle\sum_{i=1}^{k} p_{ii}[(1-p_o) – (p_{.i} + p_{i.})(1-p_o)]^2 \\
\ \ \ + (1-p_o)^2 \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1 \atop i\not=j}^{k} p_{ij} (p_{.i} + p_{j.})^2 – (p_op_c-2p_c+p_o)^2 \} $

and under the null hypothesis it is given by:

$\ \ \ \widehat{\var}(\hat{\kappa}) = \frac{1}{N(1-p_c)^2} \{ \displaystyle\sum_{i=1}^{k} p_{.i}p_{i.} [1- (p_{.i} + p_{i.})^2] + \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1, i\not=j}^{k} p_{.i}p_{j.}(p_{.i} + p_{j.})^2 – p_c^2 \} $

Congalton's method seems to be based on the delta method for obtaining variances (Agresti, 1990; Agresti, 2002); however I am not sure on what the delta method is or why it has to be used. The $\kappa$ variance, under this method, is given by:

$\ \ \ \widehat{\var}(\hat{\kappa}) = \frac{1}{n} \{ \frac{\theta_1 (1-\theta_1)}{(1-\theta_2)^2} + \frac{2(1-\theta_1)(2\theta_1\theta_2-\theta_3)}{(1-\theta_2)^3} + \frac{(1-\theta_1)^2(\theta_4-4\theta_2^2)}{(1-\theta_2)^4} \} $

in which

$\ \ \ \theta_1 = \frac{1}{n} \displaystyle\sum_{i=1}^{k} n_{ii}$

$\ \ \ \theta_2 = \frac{1}{n^2} \displaystyle\sum_{i=1}^{k} n_{i+}n_{+i}$

$\ \ \ \theta_3 = \frac{1}{n^2} \displaystyle\sum_{i=1}^{k} n_{ii}(n_{i+} + n_{+i})$

$\ \ \ \theta_4 = \frac{1}{n^3} \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1}^{k} n_{ij}(n_{j+} + n_{+i})^2$

(Congalton uses a $+$ subscript rather than a $.$, but it seems to mean the same thing. In addition, I am supposing that $n_{ij}$ should be a counting matrix, i.e. the confusion matrix before being divided by the number of samples as related by the formula $p_{ij} = \frac{n_{ij}}{\mathrm{samples}}$)

Another weird part is that Colgaton's book seems to refer the original paper by Cohen, but does not seems to cite the corrections to the Kappa variance published by Fleiss et al, not until he goes on to discuss weighted Kappa. Perhaps his first publication was written when the true formula for kappa was still lost in confusion?

Is somebody able to explain why those differences? Or why would someone use the delta method variance instead of the corrected version by Fleiss?

[1]: Fleiss, Joseph L.; Cohen, Jacob; Everitt, B. S.; Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, Vol 72(5), Nov 1969, 323-327. doi: 10.1037/h0028106

[2]: Cohen, Jacob (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1): 37–46. DOI:10.1177/001316446002000104.

[3]: Alan Agresti, Categorical Data Analysis, 2nd edition. John Wiley and Sons, 2002.

[4]: Russell G. Congalton and Green, K.; Assessing the Accuracy of Remotely Sensed Data: Principles and Practices, 2nd edition. 2009.

Best Answer

I don't know which of the two ways to calculate the variance is to prefer but I can give you a third, practical and useful way to calculate confidence/credible intervals by using Bayesian estimation of Cohen's Kappa.

The R and JAGS code below generates MCMC samples from the posterior distribution of the credible values of Kappa given the data.

library(rjags)
library(coda)
library(psych)

# Creating some mock data
rater1 <- c(1, 2, 3, 1, 1, 2, 1, 1, 3, 1, 2, 3, 3, 2, 3) 
rater2 <- c(1, 2, 2, 1, 2, 2, 3, 1, 3, 1, 2, 3, 2, 1, 1) 
agreement <- rater1 == rater2
n_categories <- 3
n_ratings <- 15

# The JAGS model definition, should work in WinBugs with minimal modification
kohen_model_string <- "model {
  kappa <- (p_agreement - chance_agreement) / (1 - chance_agreement)
  chance_agreement <- sum(p1 * p2)

  for(i in 1:n_ratings) {
    rater1[i] ~ dcat(p1)
    rater2[i] ~ dcat(p2)
    agreement[i] ~ dbern(p_agreement)
  }

  # Uniform priors on all parameters
  p1 ~ ddirch(alpha)
  p2 ~ ddirch(alpha)
  p_agreement ~ dbeta(1, 1)
  for(cat_i in 1:n_categories) {
    alpha[cat_i] <- 1
  }
}"

# Running the model
kohen_model <- jags.model(file = textConnection(kohen_model_string),
                 data = list(rater1 = rater1, rater2 = rater2,
                   agreement = agreement, n_categories = n_categories,
                   n_ratings = n_ratings),
                 n.chains= 1, n.adapt= 1000)

update(kohen_model, 10000)
mcmc_samples <- coda.samples(kohen_model, variable.names="kappa", n.iter=20000)

The plot below shows a density plot of the MCMC samples from the posterior distribution of Kappa.

Posterior Kappa density

Using the MCMC samples we can now use the median value as an estimate of Kappa and use the 2.5% and 97.5% quantiles as a 95 % confidence/credible interval.

summary(mcmc_samples)$quantiles
##      2.5%        25%        50%        75%      97.5% 
## 0.01688361 0.26103573 0.38753814 0.50757431 0.70288890

Compare this with the "classical" estimates calculated according to Fleiss, Cohen and Everitt:

cohen.kappa(cbind(rater1, rater2), alpha=0.05)
##                  lower estimate upper
## unweighted kappa  0.041     0.40  0.76

Personally I would prefer the Bayesian confidence interval over the classical confidence interval, especially since I believe the Bayesian confidence interval have better small sample properties. A common concern people tend to have with Bayesian analyses is that you have to specify prior beliefs regarding the distributions of the parameters. Fortunately, in this case, it is easy to construct "objective" priors by simply putting uniform distributions over all the parameters. This should make the outcome of the Bayesian model very similar to a "classical" calculation of the Kappa coefficient.

References

Sanjib Basu, Mousumi Banerjee and Ananda Sen (2000). Bayesian Inference for Kappa from Single and Multiple Studies. Biometrics, Vol. 56, No. 2 (Jun., 2000), pp. 577-582

Best Answer

References

Related Solutions

Solved – Using Cohen’s kappa for multiple item types (e.g. binary and non-binary)

Related Question