Solved – Power analysis for inter-rater reliability study (Kappa) with multiple raters

agreement-statisticscohens-kappastatistical-power

I've spent some time looking through literature about sample size calculation for Cohen's kappa and found several studies stating that increasing the number of raters reduces the number of subjects required to get the same power. I think this is logical when looking at inter-rater reliability by use of kappa statistics. But there is, as far as I can see, no specific calculation or reference for the statement. In this link there is calculation for 2 raters.

  • Is anyone familiar with similar calculation for several raters?
  • Other factors that would affect the number of subjects required?

I will (probably) have 5 categories of nominal data. There might be combined findings. There will be 3 raters.

I found this article saying something about sample size and several raters:
Sim, J. and Wright, C. C. (2005) Interpretation, and Sample Size Requirements The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements, Journal of the American Physical Therapy Association, 85, pp. 257–268.

When seeking to optimize sample size, the investigator needs to choose
the appropriate balance between the number of raters examining each
subject and the number of subjects. In some instances, it is more
practical to increase the number of raters rather than increase the
number of subjects. However, according to Shoukri, when seeking to
detect a kappa of .40 or greater on a dichotomous variable, it is not
advantageous to use more than 3 raters per subject—it can be shown
that for a fixed number of observations, increasing the number of
raters beyond 3 has little effect on the power of hypothesis tests or
the width of confidence intervals. Therefore, increasing the number of
subjects is the more effective strategy for maximizing power.

Best Answer

You can use the R package kappaSize for sample size calculation or power analysis in the cases of 2-6 raters and 3-5 categories.

Three functions in the package are relevant depending on the number of categories: Power3Cats(), Power4Cats() and Power5Cats().

Related Question