I'm a little confused regarding the intraclass correlation coefficient and one-way ANOVA. As I understand it, both tell you how similar observations within a group are, relative to observations in other groups.
Could someone explain this a little better, and perhaps explain the situation(s) in which each method is more advantageous?
Best Answer
Both methods rely on the same idea, that of decomposing the observed variance into different parts or components. However, there are subtle differences in whether we consider items and/or raters as fixed or random effects. Apart from saying what part of the total variability is explained by the between factor (or how much the between variance departs from the residual variance), the F-test doesn't say much. At least this holds for a one-way ANOVA where we assume a fixed effect (and which corresponds to the ICC(1,1) described below). On the other hand, the ICC provides a bounded index when assessing rating reliability for several "exchangeable" raters, or homogeneity among analytical units.
We usually make the following distinction between the different kind of ICCs. This follows from the seminal work of Shrout and Fleiss (1979):
This corresponds to cases 1 to 3 in their Table 1. An additional distinction can be made depending on whether we consider that observed ratings are the average of several ratings (they are called ICC(1,k), ICC(2,k), and ICC(3,k)) or not.
In sum, you have to choose the right model (one-way vs. two-way), and this is largely discussed in Shrout and Fleiss's paper. A one-way model tend to yield smaller values than the two-way model; likewise, a random-effects model generally yields lower values than a fixed-effects model. An ICC derived from a fixed-effects model is considered as a way to assess raters consistency (because we ignore rater variance), while for a random-effects model we talk of an estimate of raters agreement (whether raters are interchangeable or not). Only the two-way models incorporate the rater x subject interaction, which might be of interest when trying to unravel untypical rating patterns.
The following illustration is readily a copy/paste of the example from
ICC()
in the psych package (data come from Shrout and Fleiss, 1979). Data consists in 4 judges (J) asessing 6 subjects or targets (S) and are summarized below (I will assume that it is stored as an R matrix namedsf
)This example is interesting because it shows how the choice of the model might influence the results, therefore the interpretation of the reliability study. All 6 ICC models are as follows (this is Table 4 in Shrout and Fleiss's paper)
As can be seen, considering raters as fixed effects (hence not trying to generalize to a wider pool of raters) would yield a much higher value for the homogeneity of the measurement. (Similar results could be obtained with the irr package (
icc()
), although we must play with the different option for model type and unit of analysis.)What do the ANOVA approach tell us? We need to fit two models to get the relevant mean squares:
No need to look at the F-test, only MSs are of interest here.
Now, we can assemble the different pieces in an extended ANOVA Table which looks like the one shown below (this is Table 3 in Shrout and Fleiss's paper):
(source: mathurl.com)
where the first two rows come from the one-way model, whereas the next two ones come from the two-way ANOVA.
It is easy to check all formulae in Shrout and Fleiss's article, and we have everything we need to estimate the reliability for a single assessment. What about the reliability for the average of multiple assessments (which often is the quantity of interest in inter-rater studies)? Following Hays and Revicki (2005), it can be obtained from the above decomposition by just changing the total MS considered in the denominator, except for the two-way random-effects model for which we have to rewrite the ratio of MSs.
Again, we find that the overall reliability is higher when considering raters as fixed effects.
References