ANOVA – Intraclass Correlation Coefficient vs. F-Test (One-Way)

anovaintraclass-correlationpsychometricsreliability

I'm a little confused regarding the intraclass correlation coefficient and one-way ANOVA. As I understand it, both tell you how similar observations within a group are, relative to observations in other groups.

Could someone explain this a little better, and perhaps explain the situation(s) in which each method is more advantageous?

Best Answer

Both methods rely on the same idea, that of decomposing the observed variance into different parts or components. However, there are subtle differences in whether we consider items and/or raters as fixed or random effects. Apart from saying what part of the total variability is explained by the between factor (or how much the between variance departs from the residual variance), the F-test doesn't say much. At least this holds for a one-way ANOVA where we assume a fixed effect (and which corresponds to the ICC(1,1) described below). On the other hand, the ICC provides a bounded index when assessing rating reliability for several "exchangeable" raters, or homogeneity among analytical units.

We usually make the following distinction between the different kind of ICCs. This follows from the seminal work of Shrout and Fleiss (1979):

One-way random effects model, ICC(1,1): each item is rated by different raters who are considered as sampled from a larger pool of potential raters, hence they are treated as random effects; the ICC is then interpreted as the % of total variance accounted for by subjects/items variance. This is called the consistency ICC.
Two-way random effects model, ICC(2,1): both factors -- raters and items/subjects -- are viewed as random effects, and we have two variance components (or mean squares) in addition to the residual variance; we further assume that raters assess all items/subjects; the ICC gives in this case the % of variance attributable to raters + items/subjects.
Two-way mixed model, ICC(3,1): contrary to the one-way approach, here raters are considered as fixed effects (no generalization beyond the sample at hand) but items/subjects are treated as random effects; the unit of analysis may be the individual or the average ratings.

This corresponds to cases 1 to 3 in their Table 1. An additional distinction can be made depending on whether we consider that observed ratings are the average of several ratings (they are called ICC(1,k), ICC(2,k), and ICC(3,k)) or not.

In sum, you have to choose the right model (one-way vs. two-way), and this is largely discussed in Shrout and Fleiss's paper. A one-way model tend to yield smaller values than the two-way model; likewise, a random-effects model generally yields lower values than a fixed-effects model. An ICC derived from a fixed-effects model is considered as a way to assess raters consistency (because we ignore rater variance), while for a random-effects model we talk of an estimate of raters agreement (whether raters are interchangeable or not). Only the two-way models incorporate the rater x subject interaction, which might be of interest when trying to unravel untypical rating patterns.

The following illustration is readily a copy/paste of the example from ICC() in the psych package (data come from Shrout and Fleiss, 1979). Data consists in 4 judges (J) asessing 6 subjects or targets (S) and are summarized below (I will assume that it is stored as an R matrix named sf)

   J1 J2 J3 J4
S1  9  2  5  8
S2  6  1  3  2
S3  8  4  6  8
S4  7  1  2  6
S5 10  5  6  9
S6  6  2  4  7

This example is interesting because it shows how the choice of the model might influence the results, therefore the interpretation of the reliability study. All 6 ICC models are as follows (this is Table 4 in Shrout and Fleiss's paper)

Intraclass correlation coefficients 
                         type  ICC    F df1 df2       p lower bound upper bound
Single_raters_absolute   ICC1 0.17  1.8   5  18 0.16477      -0.133        0.72
Single_random_raters     ICC2 0.29 11.0   5  15 0.00013       0.019        0.76
Single_fixed_raters      ICC3 0.71 11.0   5  15 0.00013       0.342        0.95
Average_raters_absolute ICC1k 0.44  1.8   5  18 0.16477      -0.884        0.91
Average_random_raters   ICC2k 0.62 11.0   5  15 0.00013       0.071        0.93
Average_fixed_raters    ICC3k 0.91 11.0   5  15 0.00013       0.676        0.99

As can be seen, considering raters as fixed effects (hence not trying to generalize to a wider pool of raters) would yield a much higher value for the homogeneity of the measurement. (Similar results could be obtained with the irr package (icc()), although we must play with the different option for model type and unit of analysis.)

What do the ANOVA approach tell us? We need to fit two models to get the relevant mean squares:

a one-way model that considers subject only; this allows to separate the targets being rated (between-group MS, BMS) and get an estimate of the within-error term (WMS)
a two-way model that considers subject + rater + their interaction (when there's no replications, this last term will be confounded with the residuals); this allows to estimate the rater main effect (JMS) which can be accounted for if we want to use a random effects model (i.e., we'll add it to the total variability)

No need to look at the F-test, only MSs are of interest here.

library(reshape)
sf.df <- melt(sf, varnames=c("Subject", "Rater"))
anova(lm(value ~ Subject, sf.df))
anova(lm(value ~ Subject*Rater, sf.df))

Now, we can assemble the different pieces in an extended ANOVA Table which looks like the one shown below (this is Table 3 in Shrout and Fleiss's paper):

_{(source: mathurl.com)}

where the first two rows come from the one-way model, whereas the next two ones come from the two-way ANOVA.

It is easy to check all formulae in Shrout and Fleiss's article, and we have everything we need to estimate the reliability for a single assessment. What about the reliability for the average of multiple assessments (which often is the quantity of interest in inter-rater studies)? Following Hays and Revicki (2005), it can be obtained from the above decomposition by just changing the total MS considered in the denominator, except for the two-way random-effects model for which we have to rewrite the ratio of MSs.

In case of ICC(1,1)=(BMS-WMS)/(BMS+(k-1)•WMS), the overall reliability is computed as (BMS-WMS)/BMS=0.443.
For the ICC(2,1)=(BMS-EMS)/(BMS+(k-1)•EMS+k•(JMS-EMS)/N), the overall reliability is (N•(BMS-EMS))/(N•BMS+JMS-EMS)=0.620.
Finally, for the ICC(3,1)=(BMS-EMS)/(BMS+(k-1)•EMS), we have a reliability of (BMS-EMS)/BMS=0.909.

Again, we find that the overall reliability is higher when considering raters as fixed effects.

References

Shrout, P.E. and Fleiss, J.L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420-3428.
Hays, R.D. and Revicki, D. (2005). Reliability and validity (including responsiveness). In Fayers, P. and Hays, R.D. (eds.), Assessing Quality of Life in Clinical Trials, 2nd ed., pp. 25-39. Oxford University Press.

Best Answer

References

Related Solutions

Solved – A mixed ANOVA showed no significant interaction, but a dep. t-test showed a significant effect for only one of the groups

ANOVA – Understanding Sum-to-Zero Constraint in One-Way ANOVA and Identifiability Issues

Related Question