I've read some of the answers here (specifically the 'Inter-rater reliability for ordinal or interval data' one). I'm still somewhat perplexed though!
I have data for 4 raters, who each rated every subject's CT scan, on two occasions, for the presence or absence of 9 signs. The two readings were separated by a washout period. As such, my data looks like this (for each sign/variable):
A1 A2 B1 B2 C1 C2 D1 D2
0 1 1 1 0 1 0 0
1 1 0 0 1 1 1 1
etc
With A1 representing the first reading by rater A, and A2 the second, and so on. Calculating the intrarater reliability is easy enough, but for inter-, I got the Fleiss Kappa and used bootstrapping to estimate the CIs, which I think is fine. Except, obviously this views each 'rating' by a given rater as being different 'raters'. I thought about using the ICC with the WinPepi function which can provide interrater agreement for fixed raters, but I suspect it might not be appropriate for binary data?
Would you just get the Fleiss and report it as such?
PS: The answer provided below by StasK is brilliant, and did clarify some things, but because of an ambiguity in the original question it isn't quite what I was looking for…
Best Answer
I assume that A through D are different symptoms, say, and 1 and 2 are the two raters. As you tagged this in Stata, I will build a Stata example. Let us first simulate some data: we have a bunch of subjects with two uncorrelated traits, and a battery of questions, tapping upon these traits. The two raters have different sensitivities to each of the traits: the first rater is a tad more likely than the second rater to give a positive answer on question A, but slightly less likely to give a positive answer on question B, etc.
This should produce something like
which I hope resembles your data, at least in terms of the existing variables.
A fully non-parametric summary of the inter-rater agreement can be constructed by converting the binary representation into a decimal representation. The outcome a1=0, b1=0, c1=0, c4=0 is 0000b=0; the outcome in the first observation is 1011b = 11, etc. Let us produce this encoding:
This should produce something like
Now, these patterns are perfectly comparable using
kap
:You can play with the sample size or with the differences between raters to produce a non-significant answer :). This kappa suffers from a serious drawback: it does not reflect the fact of having some common items: the patterns 0001 and 0000, even though they match by 75%, would be considered non-matches within this approach. So it is an extremely conservative measure of the inter-rater agreement.
To get fair estimates of all the ICCs, you would need to run a cross-classified mixed model. Let us first
reshape
the data to make it possible:Now, we can run
xtmelogit
(orgllamm
if you like it better) on this data:This is a cross-classified model with three random effects: subjects, raters and items, assuming that they are uncorrelated (which is wrong for this data; see below). Let us now estimate the ICCs:
(Hint: I figured out the names of the parameters by
matrix list e(b)
.)These are ICCs corresponding to raters, subjects and items, respectively. The zero ICC of the raters actually makes sense in the context of how the data were generated: there is no systematic effect in the sense that one rater consistently rates the condition better or worse than the other rater. There is an interaction between rater and item, but the model does not reflect it. True to life would be something like
With this specification, you would have to get ICCs by an even more complicated mix of the variance components and the point estimates from the fixed effects part of the model.
If you have the patience (or a powerful computer), you can specify
intp(7)
or something like that to get an approximation more accurate than the Laplace approximation (a single point at the mode of the distribution of the random effects).