What is the best inter-rater agreement test for Likert scale type questions? As far as I see, Cronbach's $\alpha$ is for internal consistency, or it shows how good items are related to describe the main question. I want to measure inter-rater agreement. Is Cronbach's $\alpha$ the right metric for doing this?
Solved – Inter-rater agreement for Likert scale
agreement-statisticscronbachs-alphaintraclass-correlationlikert
Related Solutions
- With regards to whether you should compute agreement for each item, this depends somewhat on how you plan to analyse the data.
- If you plan to compute scale scores (e.g., sum up the binary responses or sum up the likert responses) to form a scale, then you could perform a reliability analysis on the scale scores. In this situation, you may be starting to have enough scale points to use other procedures for inter-rater reliability assessment that assume numeric data, such as looking at ICC. Your overall evaluation of reliability would then focus on the scale score. Reliability analysis of individual items might then just be used as a means of assessing which items to include in the composite scale (e.g., you could drop items with particularly low agreement).
- If you plan to report individual items, then you would want to report kappa for each item. You may still find it useful to summarise these individual kappas, in order to quickly communicate the general reliability of the items (e.g., report range, mean, and sd of kappa across items).
- If you don't like the Kappa values that you are getting, this is not a reason not to use Kappa (apologies for the triple negative).
- It may be that your rules of thumb for interpreting Kappa are inappropriate.
- Alternatively, it may be that items are just not that reliable (high percentages of agreement can be obtained when variables are skewed even when the two raters disagree on which cases are in the minority category). In general, individual items are going to be less reliable than composite scales; also some binary evaluations are quite clear (e.g., gender), but in other cases where a judge is being asked whether an object passes over some threshold, ratings might be more reliable if they were asked to rate on a continuum.
- You can use an ordinal kappa on likert items. @chl has an excellent discussion of the issues and alternatives here.
I'd rather answer on the grounds of the methodology itself, rather than how to "fix" the situation. In another context, I assisted in working on a ratings and classification system, and found that inter-rater agreement was disappointingly low. Two paths were considered
- Change how rating agreements were defined and identify those who seemed to "understand" the task, or
- Refine the definitions used, along with the guidance and examples provider to raters, so that they could more easily understand how to rate things.
In the first sceneario, the whole methodology and results could be rendered a waste simply because the inter-rater reliability was low. It indicated that either the original definitions were bad or that the raters were given poor instructions. If I proceeded along that path, I was sure to have problems.
In the second case, the agreement between raters was very good. Since they rated quite a lot of items, they could also give feedback with when they thought the original definitions and guidance were inadequate. In the end, the methodology was very reproducible.
Based on that, I would not yet modify your set of raters, but return to the original definitions and guidance. Any tinkering after the rating is a problem, though it can be useful as a quality check. There are sometimes raters that are going to do what they want, no matter the guidance given. With good statistical methods, it is easy to identify them and weight their contributions appropriately.
Now, if I'm mistaken and you don't plan to do further collection, i.e. your data is already collected and done, what you may do is PCA or something like it, and see if you can get a sense of how the different doctors (or patients) cluster.
Were the patients exposed to all of the doctors at the same time (e.g. through a video recording) or were they exposed sequentially, and had a chance to modify their presentation with each interaction? If the latter, then there could be issues with the patients, and not the doctors.
Best Answer
Krippendorff's alpha, originally developed in the field of content analysis, is well-suited for dealing with ordinal ratings such as Likert-scale ratings. It has several advantages over some other measures such as Cohen's Kappa, Fleiss's Kappa, Cronbach's alpha: it is capable of dealing with more than 2 raters; it is robust to missing data; and it can handle different types of scales: nominal, ordinal, etc.
It also accounts for chance agreements better than some other measures like Cohen's Kappa.
Calculation of Krippendorff's alpha is supported by several statistical software packages, including R (by the
irr
package), SPSS, etc.Below are some relevant papers, that discuss Krippendorff's alpha including its properties and its implementation, and compare it with other measures:
Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77-89.
Krippendorff, K. (2004). Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research, 30(3), 411-433. doi: 10.1111/j.1468-2958.2004.tb00738.x
Chapter 3 in Krippendorff, K. (2013). Content Analysis: An Introduction to Its Methodology (3rd ed.): Sage.
There are some additional technical papers in Krippendorff's website.