I used the following R script to calculate the Kappa and ICC for a measured done by two raters on 23 subjects:
library(irr)
temp <- structure(list(value.x = c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 2, 10, 10, 2),
value.y = c(8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 8, 10, 10, 10, 10, 5, 10, 8, 5)),
.Names = c("value.x", "value.y"), row.names = c(NA, -23L), class = "data.frame")
kappa2(temp[,c("value.x","value.y")],weight="squared")
Cohen's Kappa for 2 Raters (Weights: squared)
Subjects = 23
Raters = 2
Kappa = 0.121
z = 1.61
p-value = 0.106
icc(temp[,c("value.x","value.y")],model="twoway",type="agreement",unit="average")
Average Score Intraclass Correlation
Model: twoway
Type : agreement
Subjects = 23
Raters = 2
ICC(A,2) = 0.892
F-Test, H0: r0 = 0 ; H1: r0 > 0
F(22,22.2) = 8.98 , p = 1.27e-06
95%-Confidence Interval for ICC Population Values:
0.746 < ICC < 0.954
The measurement is actually a 5-point scale which variables are c(0,2,5,8,10)
.
I am wondering why Kappa and ICC gives so much different result: Kappa concluded that there is poor agreement, but ICC concluded that there is excellent agreement.
Which one should I believe?
Best Answer
I think a plot of your data is revealing:
The ICC and kappa are going to be differently influenced by the bimodal nature of both ratings, particularly for rater x, who rated all things 2 or 10.
If "low" vs. "high" is all you are concerned about, there is nearly perfect agreement. But if the fact that x gave some people 10 who y gave 8, and x gave some people 2 who y gave 5 is of interest, then the agreement is not so good.
This could be seen even more clearly in a plot of just the values over 5:
among the high ratings, it's pretty "blobby".