Reliability – Can Cohen’s Kappa Be Used for Two Judgements Only?

information retrievalreliability

I am using Cohen's Kappa to calculate the inter-agreement between two judges.

It is calculated as:

$ \frac{P(A) – P(E)}{1 – P(E)} $

where $P(A)$ is the proportion of agreement and $P(E)$ the probability of agreement by chance.

Now for the following dataset, I get the expected results:

User A judgements: 
  - 1, true
  - 2, false
User B judgements: 
  - 1, false
  - 2, false
Proportion agreed: 0.5
Agreement by chance: 0.625
Kappa for User A and B: -0.3333333333333333

We can see that both judges have not agreed very well. However in the following case where both judges evaluate one criteria, kappa evaluates to zero:

User A judgements: 
  - 1, false
User B judgements: 
  - 1, false
Proportion agreed: 1.0
Agreement by chance: 1.0
Kappa for User A and B: 0

Now I can see that the agreement by chance is obviously 1, which leads to kappa being zero, but does this count as a reliable result? The problem is that I normally don't have more than two judgements per criteria, so these will all never evaluate to any kappa greater than 0, which I think is not very representative.

Am I right with my calculations? Can I use a different method to calculate inter-agreement?

Here we can see that kappa works fine for multiple judgements:

User A judgements: 
  - 1, false
  - 2, true
  - 3, false
  - 4, false
  - 5, true
User A judgements: 
  - 1, true
  - 2, true
  - 3, false
  - 4, true
  - 5, false
Proportion agreed: 0.4
Agreement by chance: 0.5
Kappa for User A and B: -0.19999999999999996

Best Answer

The "chance correction" in Cohen's $\kappa$ estimates probabilities with which each rater chooses the existing categories. The estimation comes from the marginal frequencies of the categories. When you only have 1 judgement for each rater, this means that $\kappa$ assumes the category chosen for this single judgement in general has a probability of 1. This obviously makes no sense since the number of judgements (1) is too small to reliably estimate the base rates of all categories.

An alternative might be a simple binomial model: without additional information, we might assume that the probability of agreement between two raters for one judgement is 0.5 since judgements are binary. This means that we implicitly assume that both raters pick each category with probability 0.5 for all criteria. The number of agreements expected by chance over all criteria then follows a binomial distribution with $p=0.5$.

Related Solutions

Solved – Computing Cohen’s Kappa variance (and standard errors)

I don't know which of the two ways to calculate the variance is to prefer but I can give you a third, practical and useful way to calculate confidence/credible intervals by using Bayesian estimation of Cohen's Kappa.

The R and JAGS code below generates MCMC samples from the posterior distribution of the credible values of Kappa given the data.

library(rjags)
library(coda)
library(psych)

# Creating some mock data
rater1 <- c(1, 2, 3, 1, 1, 2, 1, 1, 3, 1, 2, 3, 3, 2, 3) 
rater2 <- c(1, 2, 2, 1, 2, 2, 3, 1, 3, 1, 2, 3, 2, 1, 1) 
agreement <- rater1 == rater2
n_categories <- 3
n_ratings <- 15

# The JAGS model definition, should work in WinBugs with minimal modification
kohen_model_string <- "model {
  kappa <- (p_agreement - chance_agreement) / (1 - chance_agreement)
  chance_agreement <- sum(p1 * p2)

  for(i in 1:n_ratings) {
    rater1[i] ~ dcat(p1)
    rater2[i] ~ dcat(p2)
    agreement[i] ~ dbern(p_agreement)
  }

  # Uniform priors on all parameters
  p1 ~ ddirch(alpha)
  p2 ~ ddirch(alpha)
  p_agreement ~ dbeta(1, 1)
  for(cat_i in 1:n_categories) {
    alpha[cat_i] <- 1
  }
}"

# Running the model
kohen_model <- jags.model(file = textConnection(kohen_model_string),
                 data = list(rater1 = rater1, rater2 = rater2,
                   agreement = agreement, n_categories = n_categories,
                   n_ratings = n_ratings),
                 n.chains= 1, n.adapt= 1000)

update(kohen_model, 10000)
mcmc_samples <- coda.samples(kohen_model, variable.names="kappa", n.iter=20000)

The plot below shows a density plot of the MCMC samples from the posterior distribution of Kappa.

Posterior Kappa density

Using the MCMC samples we can now use the median value as an estimate of Kappa and use the 2.5% and 97.5% quantiles as a 95 % confidence/credible interval.

summary(mcmc_samples)$quantiles
##      2.5%        25%        50%        75%      97.5% 
## 0.01688361 0.26103573 0.38753814 0.50757431 0.70288890

Compare this with the "classical" estimates calculated according to Fleiss, Cohen and Everitt:

cohen.kappa(cbind(rater1, rater2), alpha=0.05)
##                  lower estimate upper
## unweighted kappa  0.041     0.40  0.76

Personally I would prefer the Bayesian confidence interval over the classical confidence interval, especially since I believe the Bayesian confidence interval have better small sample properties. A common concern people tend to have with Bayesian analyses is that you have to specify prior beliefs regarding the distributions of the parameters. Fortunately, in this case, it is easy to construct "objective" priors by simply putting uniform distributions over all the parameters. This should make the outcome of the Bayesian model very similar to a "classical" calculation of the Kappa coefficient.

References

Sanjib Basu, Mousumi Banerjee and Ananda Sen (2000). Bayesian Inference for Kappa from Single and Multiple Studies. Biometrics, Vol. 56, No. 2 (Jun., 2000), pp. 577-582

Best Answer

Related Solutions

Solved – Computing Cohen’s Kappa variance (and standard errors)

References

Related Question