Solved – Why is Cohen’s kappa low despite high observed agreement

agreement-statisticscohens-kappareliability

I have about 10 raters giving weighted ratings to different items of a dataset. The number of raters may differ from item to another. An item may have any number of raters between 2 and 10. Raters can give each item one of 3 values (-1,0, and 1).

I tried using weighted kappa to measure the agreement between each pair of raters at a time. However, the results I am getting seem a little strange. When only 2 of the 3 values are used by the raters the result is 0 even if they agree on most items. Example:

(-1,-1; -1 ,-1;-1 , -1; -1, -1;-1 , 0; -1, 0)

even though the two raters gave the same rating to 4/6 items the result is 0.

Is cohen kappa the best suited method for my problem? If not, what's a better alternative.

Best Answer

In your example, Cohen's $\kappa$ coefficient is equal to $0$ despite observed agreement ($p_o$) being relatively high because chance agreement ($p_c$) is also high according to Cohen's assumptions.

$$ \kappa = \frac{p_o - p_c}{1 - p_c} = \frac{.667 - .667}{1 - .667} = .000 $$

You might try another chance-adjusted index that makes different assumptions than Cohen's $\kappa$ coefficient. One such option would be Bennett et al.'s $S$ score below, where $q$ is the number of possible categories. In this example, assuming the same three category options, $S$ would be higher.

$$ S = \frac{p_o - 1/q}{1 - 1/q} = \frac{.667 - .333}{1 - .333} = .500 $$

Both of these metrics (and several others) can be adapted to multiple raters, multiple categories, missing data, and weighting schemes for non-nominal categories. See my mReliability website.

Related Solutions

Solved – Computing Cohen’s Kappa variance (and standard errors)

I don't know which of the two ways to calculate the variance is to prefer but I can give you a third, practical and useful way to calculate confidence/credible intervals by using Bayesian estimation of Cohen's Kappa.

The R and JAGS code below generates MCMC samples from the posterior distribution of the credible values of Kappa given the data.

library(rjags)
library(coda)
library(psych)

# Creating some mock data
rater1 <- c(1, 2, 3, 1, 1, 2, 1, 1, 3, 1, 2, 3, 3, 2, 3) 
rater2 <- c(1, 2, 2, 1, 2, 2, 3, 1, 3, 1, 2, 3, 2, 1, 1) 
agreement <- rater1 == rater2
n_categories <- 3
n_ratings <- 15

# The JAGS model definition, should work in WinBugs with minimal modification
kohen_model_string <- "model {
  kappa <- (p_agreement - chance_agreement) / (1 - chance_agreement)
  chance_agreement <- sum(p1 * p2)

  for(i in 1:n_ratings) {
    rater1[i] ~ dcat(p1)
    rater2[i] ~ dcat(p2)
    agreement[i] ~ dbern(p_agreement)
  }

  # Uniform priors on all parameters
  p1 ~ ddirch(alpha)
  p2 ~ ddirch(alpha)
  p_agreement ~ dbeta(1, 1)
  for(cat_i in 1:n_categories) {
    alpha[cat_i] <- 1
  }
}"

# Running the model
kohen_model <- jags.model(file = textConnection(kohen_model_string),
                 data = list(rater1 = rater1, rater2 = rater2,
                   agreement = agreement, n_categories = n_categories,
                   n_ratings = n_ratings),
                 n.chains= 1, n.adapt= 1000)

update(kohen_model, 10000)
mcmc_samples <- coda.samples(kohen_model, variable.names="kappa", n.iter=20000)

The plot below shows a density plot of the MCMC samples from the posterior distribution of Kappa.

Posterior Kappa density

Using the MCMC samples we can now use the median value as an estimate of Kappa and use the 2.5% and 97.5% quantiles as a 95 % confidence/credible interval.

summary(mcmc_samples)$quantiles
##      2.5%        25%        50%        75%      97.5% 
## 0.01688361 0.26103573 0.38753814 0.50757431 0.70288890

Compare this with the "classical" estimates calculated according to Fleiss, Cohen and Everitt:

cohen.kappa(cbind(rater1, rater2), alpha=0.05)
##                  lower estimate upper
## unweighted kappa  0.041     0.40  0.76

Personally I would prefer the Bayesian confidence interval over the classical confidence interval, especially since I believe the Bayesian confidence interval have better small sample properties. A common concern people tend to have with Bayesian analyses is that you have to specify prior beliefs regarding the distributions of the parameters. Fortunately, in this case, it is easy to construct "objective" priors by simply putting uniform distributions over all the parameters. This should make the outcome of the Bayesian model very similar to a "classical" calculation of the Kappa coefficient.

References

Sanjib Basu, Mousumi Banerjee and Ananda Sen (2000). Bayesian Inference for Kappa from Single and Multiple Studies. Biometrics, Vol. 56, No. 2 (Jun., 2000), pp. 577-582

Solved – Strange values of Cohen’s kappa

Suppose there are 100 ratings. In 98 of them, both raters say "Yes". In 1, rater A says "no" and rater B says "yes"; in 1, just the opposite happens. Now kappa is near 0 (actually slightly below 0) but agreement is 98%. If you increase the '1's you lower both kappa and agreement.

Kappa is, in a way, designed to do this. It is intended to reduce the influence of matching that isn't indicative of anything.

Best Answer

Related Solutions

Solved – Computing Cohen’s Kappa variance (and standard errors)

References

Solved – Strange values of Cohen’s kappa

Related Question