Solved – Fleiss kappa in R giving strange results

cohens-kappar

I have an experiment where 4 raters gave their responses to 4 stimuli, and I need to calculate the Fleiss Kappa to check the agreements of the raters. However, I get strange results from the R function implementing the Fleiss analysis.

Participant1 <- c(16, 15, 16, 16)
Participant2 <- c(16, 16, 16, 16)
Participant3 <- c(16, 16, 16, 16)
Participant4 <- c(16, 16, 16, 15)
data <- data.frame(Participant1, Participant2, Participant3, Participant4)
data
library(irr)
kappam.fleiss(data)

The output is

> data
  Participant1 Participant2 Participant3 Participant4
1           16           16           16           16
2           15           16           16           16
3           16           16           16           16
4           16           16           16           15



> kappam.fleiss(data)
 Fleiss' Kappa for m Raters

 Subjects = 4 
  Raters = 4 
   Kappa = -0.143 

        z = -0.7 
  p-value = 0.484

The value for kappa is negative and with a non-significant p-value, despite a clear agreement between raters. Why?
Personally, I do not really get the answer to the similar question reported here: Strange values of Cohen's kappa

So, why is the Fleiss analysis useful? The results seem to me to not give an indication on how much raters agreed.

How can I simply calculate the agreement between the four raters?

Best Answer

The problem is that there is almost no variation among the raters and the tiny bit of variation that does exist is not in agreement. There are only two ratings that are not 16 and they are for different cases, so you get a negative kappa. That's a correct result. You may, however, want a different measure.

Related Solutions

Solved – Inter-rater reliability with many non-overlapping raters

Check out Krippendorff's alpha. It has several advantages over some other measures such as Cohen's Kappa, Fleiss's Kappa, Cronbach's alpha: it is robust to missing data (which I gather is the main concern you have); it is capable of dealing with more than 2 raters; and it can handle different types of scales ( nominal, ordinal, etc.), and it also accounts for chance agreements better than some other measures like Cohen's Kappa.

Calculation of Krippendorff's alpha is supported by several statistical software packages, including R (by the irr package), SPSS, etc.

Below are some relevant papers, that discuss Krippendorff's alpha including its properties and its implementation, and compare it with other measures:

Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77-89.
Krippendorff, K. (2004). Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research, 30(3), 411-433. doi: 10.1111/j.1468-2958.2004.tb00738.x
Chapter 3 in Krippendorff, K. (2013). Content Analysis: An Introduction to Its Methodology (3rd ed.): Sage.

There are some additional technical papers in Krippendorff's website

Solved – Computing Cohen’s Kappa variance (and standard errors)

I don't know which of the two ways to calculate the variance is to prefer but I can give you a third, practical and useful way to calculate confidence/credible intervals by using Bayesian estimation of Cohen's Kappa.

The R and JAGS code below generates MCMC samples from the posterior distribution of the credible values of Kappa given the data.

library(rjags)
library(coda)
library(psych)

# Creating some mock data
rater1 <- c(1, 2, 3, 1, 1, 2, 1, 1, 3, 1, 2, 3, 3, 2, 3) 
rater2 <- c(1, 2, 2, 1, 2, 2, 3, 1, 3, 1, 2, 3, 2, 1, 1) 
agreement <- rater1 == rater2
n_categories <- 3
n_ratings <- 15

# The JAGS model definition, should work in WinBugs with minimal modification
kohen_model_string <- "model {
  kappa <- (p_agreement - chance_agreement) / (1 - chance_agreement)
  chance_agreement <- sum(p1 * p2)

  for(i in 1:n_ratings) {
    rater1[i] ~ dcat(p1)
    rater2[i] ~ dcat(p2)
    agreement[i] ~ dbern(p_agreement)
  }

  # Uniform priors on all parameters
  p1 ~ ddirch(alpha)
  p2 ~ ddirch(alpha)
  p_agreement ~ dbeta(1, 1)
  for(cat_i in 1:n_categories) {
    alpha[cat_i] <- 1
  }
}"

# Running the model
kohen_model <- jags.model(file = textConnection(kohen_model_string),
                 data = list(rater1 = rater1, rater2 = rater2,
                   agreement = agreement, n_categories = n_categories,
                   n_ratings = n_ratings),
                 n.chains= 1, n.adapt= 1000)

update(kohen_model, 10000)
mcmc_samples <- coda.samples(kohen_model, variable.names="kappa", n.iter=20000)

The plot below shows a density plot of the MCMC samples from the posterior distribution of Kappa.

Posterior Kappa density

Using the MCMC samples we can now use the median value as an estimate of Kappa and use the 2.5% and 97.5% quantiles as a 95 % confidence/credible interval.

summary(mcmc_samples)$quantiles
##      2.5%        25%        50%        75%      97.5% 
## 0.01688361 0.26103573 0.38753814 0.50757431 0.70288890

Compare this with the "classical" estimates calculated according to Fleiss, Cohen and Everitt:

cohen.kappa(cbind(rater1, rater2), alpha=0.05)
##                  lower estimate upper
## unweighted kappa  0.041     0.40  0.76

Personally I would prefer the Bayesian confidence interval over the classical confidence interval, especially since I believe the Bayesian confidence interval have better small sample properties. A common concern people tend to have with Bayesian analyses is that you have to specify prior beliefs regarding the distributions of the parameters. Fortunately, in this case, it is easy to construct "objective" priors by simply putting uniform distributions over all the parameters. This should make the outcome of the Bayesian model very similar to a "classical" calculation of the Kappa coefficient.

References

Sanjib Basu, Mousumi Banerjee and Ananda Sen (2000). Bayesian Inference for Kappa from Single and Multiple Studies. Biometrics, Vol. 56, No. 2 (Jun., 2000), pp. 577-582

Best Answer

Related Solutions

Solved – Inter-rater reliability with many non-overlapping raters

Solved – Computing Cohen’s Kappa variance (and standard errors)

References

Related Question