Solved – the best Inter-rater reliability measure for a questionnaire with multiple raters

agreement-statisticscohens-kappaintraclass-correlationreliability

I have one questionnaire with 12 questions. Each question has approximately 5 rating scales to choose from (categorical but not likert). There will be 20 raters who will each complete 3 questionnaires based on 3 different scenarios. What is the best measure of inter-rater reliability for this metholodogy, Kappa or ICC? And, what is the best way to structure the data for use with either SPSS or STATA (i.e. use numeric codes for the rating scales, put coders in rows or columns, etc.)? Thanks!

Best Answer

If I am understanding correctly, you have 20 raters that each complete a 12-item questionnaire on 3 different occasions; thus, each rater completes a total of 36 items. On each item, raters have to choose a single category from 5 possible categories; these categories exhibit an ordinal (i.e., ranked) relationship. Your goal is to determine the amount of agreement between the raters on each item, accounting for the fact that the categories have an ordinal relationship and that there are many raters.

This problem can be solved using a generalized agreement index (such as the generalized kappa coefficient, the generalized pi coefficient, or the generalized S score), the RWG index, or (depending on the level of observed between-rater variance) an intraclass correlation coefficient. You can read more about these options in the citations listed below; of them, I would recommend the RWG index if you think your categories are roughly interval (i.e., evenly spaced) or, if not, then the generalized S score with ordinal weights. You should prepare your data in a $36\times20$ matrix where each row corresponds to an item and each column corresponds to a rater. Within each cell (e.g., row $i$ and column $j$), put a numerical code corresponding to the category that rater $j$ assigned to item $i$.

You can calculate the generalized S score "by hand" using the formula provided here or you can click here to get functions to calculate it in SAS, R, MATLAB, or Excel (this one costs money). Note that the S score is also called the "Brennan Prediger" or "BP" index in some functions. I would also be willing to calculate the S score for you if you posted/sent me the matrix I described above.

A few concluding thoughts... You didn't explicitly mention wanting to compare reliability between the different measurement occasions (i.e., scenarios) but you could do so by calculating reliability for three tables each corresponding to those scenarios (e.g., rows 1-12, 13-24, and 25-36). You could also compute confidence intervals for each and then compare them. You also mentioned in your comment that you had a hypothesis that raters would choose the exact same answer. If you truly want them to choose the exact same answer, then your categories should be treated as nominal and not ordinal; with ordinal categories, raters get partial credit for getting close. Also, note that the goal of inter-rater reliability is not to test the statistical hypothesis that reliability is significantly different from zero. Rather, think of it as quantifying the extent to which they were reliable.

References

Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.

James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69(1), 85–98.

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.