ICC and Kappa – Understanding ICC and Kappa Disagreements

agreement-statisticscohens-kappainterpretationintraclass-correlationsmall-sample

Much has been written on the ICC and Kappa, but there seems to be disagreement on the best measures to consider.

My purpose is to identify some measure which shows whether there was agreement between respondents an interviewee administered questionnaire. 17 people gave ratings of 0-5 to a defined list of items, rating them according to importance (NOT ranking).

I am not interested in whether the 17 participants all rated exactly the same, but only whether there is agreement that it should be rated high or not.

Following suggestions here, I have used the ICC and also Kappa but different results were produced as follows:

Kappa results

ICC results

Also I note that given the very small sample, the validity of the ICC could be questionable due to the use of the f test see this question

What are your suggestions, and way forward to comment on this

Best Answer

The issues are much better explained in chl's answer Inter-rater reliability for ordinal or interval data

Here are some observations, based on a quick perusal of Wikipedia:

  • Cohen's Kappa and the Intra-Class Correlation are measuring different things and are only asymptotically equivalent (and then only in certain cases) so no reason to expect them to give you the same number in this case.
  • the statistical tests are comparing the values of these two statistics to a null hypothesis of zero ie completely random ratings as far as inter-rater agreement goes. This is presumeably an uninteresting null hypothesis anyway (it would be a very sad test that failed to knock out that null hypothesis!), so I don't see why you'd worry too much about the exact shape of the distribution of the F statistic under it.
  • From what I read, the actual interpretation of these statistics (what is a "good" level of agreement between raters, once we're sure that at least it's not zero) is arbitrary and based on judgement and subject matter knowledge rather than statistical test.
  • The Kappa statistic appears to ignore the ordered nature of the original scale ie treats them as arbitrary categories rather than different levels on a scale. That is how I interpret the Stata output that looks individually at the agreement for each level 0, 1, 2 etc. Whereas the ICC seems to go to the other extreme and treat it as a continuous variable in a mixed effects model. Of the two evils, I'd go with the one that at least acknowledges - that 0 < 1 < 2 < 3 < 4 < 5 ie the ICC.
  • I gather there is such a thing as a weighted Kappa, which takes into account the ordinal nature of the data by incorporating the of diagonals of an agreement-disagreement table (ie how far out each rating was) but without seeing your code and knowing more about how the data are coded in Stata it appears you aren't using this option - certainly it doesn't seem to be signalled in the Stata output..
Related Question