I have one questionnaire with 12 questions. Each question has approximately 5 rating scales to choose from (categorical but not likert). There will be 20 raters who will each complete 3 questionnaires based on 3 different scenarios. What is the best measure of inter-rater reliability for this metholodogy, Kappa or ICC? And, what is the best way to structure the data for use with either SPSS or STATA (i.e. use numeric codes for the rating scales, put coders in rows or columns, etc.)? Thanks!
Solved – the best Inter-rater reliability measure for a questionnaire with multiple raters
agreement-statisticscohens-kappaintraclass-correlationreliability
Related Solutions
These are distinct ways of accounting for raters or items variance in overall variance, following Shrout and Fleiss (1979) (cases 1 to 3 in Table 1):
- One-way random effects model: raters are considered as sampled from a larger pool of potential raters, hence they are treated as random effects; the ICC is then interpreted as the % of total variance accounted for by subjects/items variance. This is called the consistency ICC.
- Two-way random effects model: both factors -- raters and items/subjects -- are viewed as random effects, and we have two variance components (or mean squares) in addition to the residual variance; we further assume that raters assess all items/subjects; the ICC gives in this case the % of variance attributable to raters + items/subjects.
- Two-way mixed model: contrary to the one-way approach, here raters are considered as fixed effects (no generalization beyond the sample at hand) but items/subjects are treated as random effects; the unit of analysis may be the individual or the average ratings.
I would say raters have to be entered as columns, although I'm not a specialist of SPSS. Dave Garson's dedicated website is worth looking at for those working with SPSS. There is also a complete on-line tutorial on reliability analysis (Robert A. Yaffee) [archived version].
For theoretical consideration about the mixed-effect approach, please consider reading my answer to this related question: Reliability in Elicitation Exercise.
I'd rather answer on the grounds of the methodology itself, rather than how to "fix" the situation. In another context, I assisted in working on a ratings and classification system, and found that inter-rater agreement was disappointingly low. Two paths were considered
- Change how rating agreements were defined and identify those who seemed to "understand" the task, or
- Refine the definitions used, along with the guidance and examples provider to raters, so that they could more easily understand how to rate things.
In the first sceneario, the whole methodology and results could be rendered a waste simply because the inter-rater reliability was low. It indicated that either the original definitions were bad or that the raters were given poor instructions. If I proceeded along that path, I was sure to have problems.
In the second case, the agreement between raters was very good. Since they rated quite a lot of items, they could also give feedback with when they thought the original definitions and guidance were inadequate. In the end, the methodology was very reproducible.
Based on that, I would not yet modify your set of raters, but return to the original definitions and guidance. Any tinkering after the rating is a problem, though it can be useful as a quality check. There are sometimes raters that are going to do what they want, no matter the guidance given. With good statistical methods, it is easy to identify them and weight their contributions appropriately.
Now, if I'm mistaken and you don't plan to do further collection, i.e. your data is already collected and done, what you may do is PCA or something like it, and see if you can get a sense of how the different doctors (or patients) cluster.
Were the patients exposed to all of the doctors at the same time (e.g. through a video recording) or were they exposed sequentially, and had a chance to modify their presentation with each interaction? If the latter, then there could be issues with the patients, and not the doctors.
Best Answer
If I am understanding correctly, you have 20 raters that each complete a 12-item questionnaire on 3 different occasions; thus, each rater completes a total of 36 items. On each item, raters have to choose a single category from 5 possible categories; these categories exhibit an ordinal (i.e., ranked) relationship. Your goal is to determine the amount of agreement between the raters on each item, accounting for the fact that the categories have an ordinal relationship and that there are many raters.
This problem can be solved using a generalized agreement index (such as the generalized kappa coefficient, the generalized pi coefficient, or the generalized S score), the RWG index, or (depending on the level of observed between-rater variance) an intraclass correlation coefficient. You can read more about these options in the citations listed below; of them, I would recommend the RWG index if you think your categories are roughly interval (i.e., evenly spaced) or, if not, then the generalized S score with ordinal weights. You should prepare your data in a $36\times20$ matrix where each row corresponds to an item and each column corresponds to a rater. Within each cell (e.g., row $i$ and column $j$), put a numerical code corresponding to the category that rater $j$ assigned to item $i$.
You can calculate the generalized S score "by hand" using the formula provided here or you can click here to get functions to calculate it in SAS, R, MATLAB, or Excel (this one costs money). Note that the S score is also called the "Brennan Prediger" or "BP" index in some functions. I would also be willing to calculate the S score for you if you posted/sent me the matrix I described above.
A few concluding thoughts... You didn't explicitly mention wanting to compare reliability between the different measurement occasions (i.e., scenarios) but you could do so by calculating reliability for three tables each corresponding to those scenarios (e.g., rows 1-12, 13-24, and 25-36). You could also compute confidence intervals for each and then compare them. You also mentioned in your comment that you had a hypothesis that raters would choose the exact same answer. If you truly want them to choose the exact same answer, then your categories should be treated as nominal and not ordinal; with ordinal categories, raters get partial credit for getting close. Also, note that the goal of inter-rater reliability is not to test the statistical hypothesis that reliability is significantly different from zero. Rather, think of it as quantifying the extent to which they were reliable.
References
Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.
James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69(1), 85–98.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.