Solved – Inconsistency of findings between Kappa and ICC for IRR study

agreement-statisticscohens-kappaintraclass-correlationreliability

I used the following R script to calculate the Kappa and ICC for a measured done by two raters on 23 subjects:

library(irr)
temp <- structure(list(value.x = c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 2, 10, 10, 2), 
                       value.y = c(8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 8, 10, 10, 10, 10, 5, 10, 8, 5)),
                  .Names = c("value.x", "value.y"), row.names = c(NA, -23L), class = "data.frame")

kappa2(temp[,c("value.x","value.y")],weight="squared")

 Cohen's Kappa for 2 Raters (Weights: squared)

 Subjects = 23 
   Raters = 2 
    Kappa = 0.121 

        z = 1.61 
  p-value = 0.106 

icc(temp[,c("value.x","value.y")],model="twoway",type="agreement",unit="average")

 Average Score Intraclass Correlation

   Model: twoway 
   Type : agreement 

   Subjects = 23 
     Raters = 2 
   ICC(A,2) = 0.892

 F-Test, H0: r0 = 0 ; H1: r0 > 0 
 F(22,22.2) = 8.98 , p = 1.27e-06 

 95%-Confidence Interval for ICC Population Values:
  0.746 < ICC < 0.954

The measurement is actually a 5-point scale which variables are c(0,2,5,8,10).

I am wondering why Kappa and ICC gives so much different result: Kappa concluded that there is poor agreement, but ICC concluded that there is excellent agreement.

Which one should I believe?

Best Answer

I think a plot of your data is revealing:

value.x = c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
            10, 10, 10, 10, 10, 10, 2, 10, 10, 2)
value.y = c(8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
            8, 10, 10, 10, 10, 5, 10, 8, 5)
plot(jitter(value.x), jitter(value.y))

The ICC and kappa are going to be differently influenced by the bimodal nature of both ratings, particularly for rater x, who rated all things 2 or 10.

If "low" vs. "high" is all you are concerned about, there is nearly perfect agreement. But if the fact that x gave some people 10 who y gave 8, and x gave some people 2 who y gave 5 is of interest, then the agreement is not so good.

This could be seen even more clearly in a plot of just the values over 5:

ratings <- data.frame(cbind(value.x, value.y))
subset <- subset(ratings, value.x > 2 & value.y > 2)

with(subset, plot(jitter(value.x), jitter(value.y)))

among the high ratings, it's pretty "blobby".