Agreement Statistics – Methods for Comparing Inter-Rater Agreement Between Rater Classes

agreement-statistics

I'm interested in comparing ratings of the same objects that were done by raters from 2 different GROUPS/CLASSES (Experts, and Semi-Experts), so I can decide whether Semi-experts can replace experts in my setting. Though I thought I'd easily find a recommended method for doing this in the literature, after a long search, I'm not sure this is addressed.

The following figure illustrates the data structure:

Data structure: ratings from multiple identified raters from two rater-groups (classes)

(*See additional notes at bottom regarding the data structure)

To do the comparison, I believe I need to:

a. See to what extent experts and semi-experts ratings agree with each other (the easy part, but bear with me. there's a question there too); and

b. Check the inter-rater agreement within each group and say if they are distinguishable from each other.

I've searched the literature. Doing (a) seems straightforward with Krippendorf's alpha. My dataset (which is from real data, not a designed experiment) includes multiple ratings (0-3) per object from each raters' group (experts, semi-experts).

I thought of averaging ratings per object, per group, thus creating a dataset with 2 rows that emulate two "raters" (the prototypical expert and the prototypical semi-expert). Then use Krippendorf's alpha. Here's an illustration:

Condensed data: ratings from each group are averaged, to show the "prototypical" rating from an (Expert/Semi-Expert)

It's not clear to me, however, that this averaging process is a good idea. I could not find a clear answer to that in Krippendorf's content analysis book, or papers.

(I chose Krippendorf's alpha as in some cases I have more than 2 raters; plus it's more robust to missing data and other issues. But Krippendorf designed his alpha to check for consistency in agreement, not to compare raters from different groups; and it seems to be the case with other common measures as well. So I'm hesitant).

As for (b), say I get K-alpha (experts) = x, K-alpha (semi-experts) = y. Is there a way to test the significance of the difference?

I've spent a good few days trying to find an answer, so this is not a simple case. Please bother answering only if you really know this stuff deeply. Many thanks for your consideration!

Notes about the data structure

I believe these two comments should not make a difference for an analysis based on Krippendorf's alpha. But I bring there here for completeness:

  1. Not all objects have ratings from all raters. Mostly, each object was rated by a couple of raters from each group)
  2. The number of raters in each group is different (30 in one, 60 in the other).

Best Answer

Here is an approach you could take. I'm going to first assume you need agreement, not consistency, but I'll show you how you can use consistency afterwards. For a great review of the difference, see this paper.

Agreement. This focuses on absolute agreement between raters - if I give it a 2, you will give it a 2. Here are the steps I would take: 1) Krippendorff's $\alpha$ across both groups. This is going to be an overall benchmark. 2) Krippendorff's $\alpha$ for each group separately. Compare the two coefficients, and see which group has a higher reliability. You can calculate confidence intervals for both and see if they cross, see Hayes and Krippendorff (2007). For an implementation in R, look at the irr package, kripp.alpha and kripp.boot. The general approach is to use bootstrapping, though I haven't implemented it myself. If the reliability of the semi-experts is statistically equivalent or close enough for your purposes, then you could proceed to consider it. If it is substantially lower, you would need to justify it by a) significantly lower costs to using semi-experts; and b) identifying ways to improve it. 3) Sufficient inter-rater reliability for semi-experts is not adequate, of course, if they are not in agreement with the experts. Here you could do a statistical comparison of the two groups distributions and central tendency - if you are comfortable comparing means on ordinal data, you have sufficient observations, and the data look normal, use standard tests like a t-test or ANOVA. Otherwise a crosstab and $\chi^2$ test may be more appropriate (just keep in mind sample size sensitivities to these tests). If there is a statistical and substantive difference between the groups, and their reliability differs substantively, then the semi-experts won't likely give you the same "quality" as the experts.

Consistency. This looks at whether the two groups are aligned, though not necessarily in agreement. If I rate highly, you will too, even if we don't both rate it the same. One common way to do this is with the intra-class correlation coefficient, the classic reference is: Shrout, P. and Fleiss, J. L. (1979) "Intraclass correlation: uses in assessing rater reliability" in Psychological Bulletin. Vol. 86, No. 2, pp. 420–428. The psych package in R has formulas for this. This basically relies on a nested ANOVA model - you could treat the reviewers as nested in two groups, and look at how much of the variance is attributed to the groupings relative to the overall variance. If you are familiar with ANOVA models it should be fairly straightforward to do (you may want to use a lmer model in the lme4 package to run a mixed effects regression, extract the variance components from there (that's how I've done it before).

Related Question