Solved – Using Cohen’s kappa for multiple item types (e.g. binary and non-binary)

cohens-kappareliability

I am looking to calculate Cohen's kappa for multiple items in a scale (15 items, with two raters). The problem is that for 12 of the items the responses available to the raters were binary (yes, no). The remaining three are categorical responses with more than two choices (e.g. cat1,cat2,cat3,cat4). Does it make sense to use the Kappa statistic across both types of variables at the same time? Or should I calculate Kappa separately for the binary variables and then again for the non-binary variables?

Thanks.

Best Answer

The standard form of kappa is for agreement between two categorical variables with the same number of categories (and indeed the same categories so you know which category on variable A matched with which on variable B.

There is a method developed by Brennan and Light in a paper entitled "Measuring agreement when two observers classify people into categories not defined in advance" and available here which deals with the case where the categories used are different and may not have the same number.

Related Solutions

Solved – Assessing and testing inter-rater agreement with kappa statistic on a set of binary and Likert items

With regards to whether you should compute agreement for each item, this depends somewhat on how you plan to analyse the data.
- If you plan to compute scale scores (e.g., sum up the binary responses or sum up the likert responses) to form a scale, then you could perform a reliability analysis on the scale scores. In this situation, you may be starting to have enough scale points to use other procedures for inter-rater reliability assessment that assume numeric data, such as looking at ICC. Your overall evaluation of reliability would then focus on the scale score. Reliability analysis of individual items might then just be used as a means of assessing which items to include in the composite scale (e.g., you could drop items with particularly low agreement).
- If you plan to report individual items, then you would want to report kappa for each item. You may still find it useful to summarise these individual kappas, in order to quickly communicate the general reliability of the items (e.g., report range, mean, and sd of kappa across items).
If you don't like the Kappa values that you are getting, this is not a reason not to use Kappa (apologies for the triple negative).
- It may be that your rules of thumb for interpreting Kappa are inappropriate.
- Alternatively, it may be that items are just not that reliable (high percentages of agreement can be obtained when variables are skewed even when the two raters disagree on which cases are in the minority category). In general, individual items are going to be less reliable than composite scales; also some binary evaluations are quite clear (e.g., gender), but in other cases where a judge is being asked whether an object passes over some threshold, ratings might be more reliable if they were asked to rate on a continuum.
You can use an ordinal kappa on likert items. @chl has an excellent discussion of the issues and alternatives here.

Solved – Computing Cohen’s Kappa variance (and standard errors)

I don't know which of the two ways to calculate the variance is to prefer but I can give you a third, practical and useful way to calculate confidence/credible intervals by using Bayesian estimation of Cohen's Kappa.

The R and JAGS code below generates MCMC samples from the posterior distribution of the credible values of Kappa given the data.

library(rjags)
library(coda)
library(psych)

# Creating some mock data
rater1 <- c(1, 2, 3, 1, 1, 2, 1, 1, 3, 1, 2, 3, 3, 2, 3) 
rater2 <- c(1, 2, 2, 1, 2, 2, 3, 1, 3, 1, 2, 3, 2, 1, 1) 
agreement <- rater1 == rater2
n_categories <- 3
n_ratings <- 15

# The JAGS model definition, should work in WinBugs with minimal modification
kohen_model_string <- "model {
  kappa <- (p_agreement - chance_agreement) / (1 - chance_agreement)
  chance_agreement <- sum(p1 * p2)

  for(i in 1:n_ratings) {
    rater1[i] ~ dcat(p1)
    rater2[i] ~ dcat(p2)
    agreement[i] ~ dbern(p_agreement)
  }

  # Uniform priors on all parameters
  p1 ~ ddirch(alpha)
  p2 ~ ddirch(alpha)
  p_agreement ~ dbeta(1, 1)
  for(cat_i in 1:n_categories) {
    alpha[cat_i] <- 1
  }
}"

# Running the model
kohen_model <- jags.model(file = textConnection(kohen_model_string),
                 data = list(rater1 = rater1, rater2 = rater2,
                   agreement = agreement, n_categories = n_categories,
                   n_ratings = n_ratings),
                 n.chains= 1, n.adapt= 1000)

update(kohen_model, 10000)
mcmc_samples <- coda.samples(kohen_model, variable.names="kappa", n.iter=20000)

The plot below shows a density plot of the MCMC samples from the posterior distribution of Kappa.

Posterior Kappa density

Using the MCMC samples we can now use the median value as an estimate of Kappa and use the 2.5% and 97.5% quantiles as a 95 % confidence/credible interval.

summary(mcmc_samples)$quantiles
##      2.5%        25%        50%        75%      97.5% 
## 0.01688361 0.26103573 0.38753814 0.50757431 0.70288890

Compare this with the "classical" estimates calculated according to Fleiss, Cohen and Everitt:

cohen.kappa(cbind(rater1, rater2), alpha=0.05)
##                  lower estimate upper
## unweighted kappa  0.041     0.40  0.76

Personally I would prefer the Bayesian confidence interval over the classical confidence interval, especially since I believe the Bayesian confidence interval have better small sample properties. A common concern people tend to have with Bayesian analyses is that you have to specify prior beliefs regarding the distributions of the parameters. Fortunately, in this case, it is easy to construct "objective" priors by simply putting uniform distributions over all the parameters. This should make the outcome of the Bayesian model very similar to a "classical" calculation of the Kappa coefficient.

References

Sanjib Basu, Mousumi Banerjee and Ananda Sen (2000). Bayesian Inference for Kappa from Single and Multiple Studies. Biometrics, Vol. 56, No. 2 (Jun., 2000), pp. 577-582

Best Answer

Related Solutions

Solved – Assessing and testing inter-rater agreement with kappa statistic on a set of binary and Likert items

Solved – Computing Cohen’s Kappa variance (and standard errors)

References

Related Question