Solved – Standard Error of Measurement (SEM) in Inter-rater reliability

agreement-statisticsreliability

I am currently writing my thesis about inter-rater reliability of a diagnostic tool between raters. I want to use the standard error of measurement with the formula:

SEM of rater 1 and rater 2 = SD * $\sqrt{1-ICC}$ where SD represents standard deviation and ICC represents the reliability of rater 1 and rater 2.

However, I cannot find what SD is necessary… Do I use the pooled SD of the 2 raters?

Best Answer

As Jeremy pointed out, there are multiple versions of the ICC, that reflect distinct ways of accounting for raters or items variance in overall variance. There's a nice summary of the use of Kappa and ICC indices for rater reliability in Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial, by Kevin A. Hallgren, and I discussed the different versions of ICC in a related post. Briefly, you have to decide whether your raters are considered as sampled from a larger pool of potential raters or as a fixed set of raters. In the first case, this means using a random effect model, while in the second raters will be treated as fixed effects. Likewise, items may be treated as either fixed or random units. Usually, we use a two-way random effects model (both raters and items are treated as random effects) to estimate relative agreement (or the ICC(2,k) version for absolute reliability, if you care about systematic error between raters). The SEM can be calculated from the square root of the mean square error from a one-way ANOVA, or from the sample standard deviation, as suggested in the other reply.

Note that the choice of SD doesn't matter as much as that of the ICC, since they can differ a lot depending on your sample size and the inherent variation of the ratings. Here are two examples of results obtained from the same dataset using the ICC2 (top) and ICC3 (bottom) approaches:

enter image description here

enter image description here

In R, there are many packages available, including psych (see ICC) and irr. I provide some examples of use of the former in my other answer, and I have some examples of the use of irr in a separate blog post on my site. Using Stata, the icc command provides all three ANOVA models and associated estimation for the ICC, as shown above.