Solved – What to do in case of low inter-rater reliability (ICC)

agreement-statisticsreliability

Background: Eight doctors each rated the same 54 patients on a persuasiveness measure (1-7 Likert scale). The mean score on the persuasiveness measure will eventually be the outcome measure of my experiment.

Inter-rater reliability was quantified as the intraclass correlation coefficient (ICC), using the two-way random effects model with consistency. Unfortunately, the inter-rater reliability of the eight doctors was low (ICC = .350, single measures). Should I still run further planned analyses with these unreliable data? Or can it possibly be justified that I only include the doctors (i.e., raters) with the highest inter-rater reliability? I found out there are two doctors with a more acceptable inter-rater reliability (ICC = .718, N = 2), but I don't think this is enough reason to exclude the other doctors from analyses. I would really appreciate any references to literature that deals with this problem.

Best Answer

I'd rather answer on the grounds of the methodology itself, rather than how to "fix" the situation. In another context, I assisted in working on a ratings and classification system, and found that inter-rater agreement was disappointingly low. Two paths were considered

  1. Change how rating agreements were defined and identify those who seemed to "understand" the task, or
  2. Refine the definitions used, along with the guidance and examples provider to raters, so that they could more easily understand how to rate things.

In the first sceneario, the whole methodology and results could be rendered a waste simply because the inter-rater reliability was low. It indicated that either the original definitions were bad or that the raters were given poor instructions. If I proceeded along that path, I was sure to have problems.

In the second case, the agreement between raters was very good. Since they rated quite a lot of items, they could also give feedback with when they thought the original definitions and guidance were inadequate. In the end, the methodology was very reproducible.

Based on that, I would not yet modify your set of raters, but return to the original definitions and guidance. Any tinkering after the rating is a problem, though it can be useful as a quality check. There are sometimes raters that are going to do what they want, no matter the guidance given. With good statistical methods, it is easy to identify them and weight their contributions appropriately.

Now, if I'm mistaken and you don't plan to do further collection, i.e. your data is already collected and done, what you may do is PCA or something like it, and see if you can get a sense of how the different doctors (or patients) cluster.

Were the patients exposed to all of the doctors at the same time (e.g. through a video recording) or were they exposed sequentially, and had a chance to modify their presentation with each interaction? If the latter, then there could be issues with the patients, and not the doctors.