Solved – Assumptions for intraclass correlation

assumptionsintraclass-correlation

I am having difficulty finding information on the assumptions for the intraclass correlation. Can someone please tell me what they are?

Best Answer

There is no context to your request... here, I give an attempt to answer it in the context of regression models. More specifically, I shall refer to the usual linear mixed model.

Let $y_{ij}$ be the observation for subject $j \in \{ 1, \ldots{}, n_i\}$ from group $i \in \{1, \ldots{}, g\}$. The model I shall consider takes the form

$y_{ij} = \underbrace{\mu_{ij}}_{\textrm{fixed part}} + \underbrace{u_{i}}_{\textrm{random part}} + e_{ij}$,

under the assumption that $u_{i}$ is a realisation of a $N(0, \sigma^2_u)$ random variable, $e_{ij}$ is a realisation of a $N(0, \sigma^2)$ random variable, and under independence between these two random variables.

Observe that all subjects from the $i$th group share the same value for the random effect. Therefore, the random effect accounts for the association between observations from the same group. Mathematically, this can be seen by computing $\textrm{Corr}(Y_{ij}, Y_{i'j'})$ (see, e.g., here). When $ i = i'$, $\textrm{Corr}(Y_{ij}, Y_{i'j'}) = \textrm{Corr}(Y_{ij}, Y_{ij'}) > 0$ is known as the intraclass correlation.

Related Solutions

Solved – Intraclass correlation and aggregation

I think (1) is not a statistical question but a subject-area one. E.g., in the described example it would be up to those who study group psychology to determine appropriate language for the strength of ICCs. This is analogous to a Pearson correlation -- what constitutes 'strong' differs depending on whether one is working in, for example, sociology or physics.

(2) is to an extent also subject-area specific -- it depends on what researchers are aiming to measure and describe. But from a statistical point of view ICC is a reasonable metric for within-team relatedness. However I agree with Mike that when you say you'd like to

"describe the extent to which the measure of team effectiveness is a property of the team member's idiosyncratic belief or a property of a shared belief about the team"

then it is probably more appropriate to use variance components in their raw form than to convert them into an ICC.

To clarify, think of the ICC as calculated within a mixed model. For a single-level mixed model with random group-level intercepts $b_i \sim N(0, \sigma^2_b)$ and within-group errors $\epsilon_{ij} \stackrel{\mathrm{iid}}{\sim} N(0, \sigma^2)$, $\sigma^2_b$ describes the amount of variation between teams and $\sigma^2$ describes variation within teams. Then, for a single team, we get a response covariance matrix of $\sigma^2 \mathbf{I} + \sigma^2_b \mathbf{1}\mathbf{1}'$ which when converted to a correlation matrix is $\frac{\sigma^2}{\sigma^2 + \sigma^2_b} \mathbf{I} + \frac{\sigma^2_b}{\sigma^2 + \sigma^2_b} \mathbf{1}\mathbf{1}'$. So, $\frac{\sigma^2_b}{\sigma^2 + \sigma^2_b} = \mathrm{ICC}$ describes the level of correlation between effectiveness responses within a team, but it sounds as though you may be more interested in $\sigma^2$ and $\sigma^2_b$, or perhaps $\frac{\sigma^2}{\sigma^2_b}$.

ANOVA – Intraclass Correlation Coefficient vs. F-Test (One-Way)

Both methods rely on the same idea, that of decomposing the observed variance into different parts or components. However, there are subtle differences in whether we consider items and/or raters as fixed or random effects. Apart from saying what part of the total variability is explained by the between factor (or how much the between variance departs from the residual variance), the F-test doesn't say much. At least this holds for a one-way ANOVA where we assume a fixed effect (and which corresponds to the ICC(1,1) described below). On the other hand, the ICC provides a bounded index when assessing rating reliability for several "exchangeable" raters, or homogeneity among analytical units.

We usually make the following distinction between the different kind of ICCs. This follows from the seminal work of Shrout and Fleiss (1979):

One-way random effects model, ICC(1,1): each item is rated by different raters who are considered as sampled from a larger pool of potential raters, hence they are treated as random effects; the ICC is then interpreted as the % of total variance accounted for by subjects/items variance. This is called the consistency ICC.
Two-way random effects model, ICC(2,1): both factors -- raters and items/subjects -- are viewed as random effects, and we have two variance components (or mean squares) in addition to the residual variance; we further assume that raters assess all items/subjects; the ICC gives in this case the % of variance attributable to raters + items/subjects.
Two-way mixed model, ICC(3,1): contrary to the one-way approach, here raters are considered as fixed effects (no generalization beyond the sample at hand) but items/subjects are treated as random effects; the unit of analysis may be the individual or the average ratings.

This corresponds to cases 1 to 3 in their Table 1. An additional distinction can be made depending on whether we consider that observed ratings are the average of several ratings (they are called ICC(1,k), ICC(2,k), and ICC(3,k)) or not.

In sum, you have to choose the right model (one-way vs. two-way), and this is largely discussed in Shrout and Fleiss's paper. A one-way model tend to yield smaller values than the two-way model; likewise, a random-effects model generally yields lower values than a fixed-effects model. An ICC derived from a fixed-effects model is considered as a way to assess raters consistency (because we ignore rater variance), while for a random-effects model we talk of an estimate of raters agreement (whether raters are interchangeable or not). Only the two-way models incorporate the rater x subject interaction, which might be of interest when trying to unravel untypical rating patterns.

The following illustration is readily a copy/paste of the example from ICC() in the psych package (data come from Shrout and Fleiss, 1979). Data consists in 4 judges (J) asessing 6 subjects or targets (S) and are summarized below (I will assume that it is stored as an R matrix named sf)

   J1 J2 J3 J4
S1  9  2  5  8
S2  6  1  3  2
S3  8  4  6  8
S4  7  1  2  6
S5 10  5  6  9
S6  6  2  4  7

This example is interesting because it shows how the choice of the model might influence the results, therefore the interpretation of the reliability study. All 6 ICC models are as follows (this is Table 4 in Shrout and Fleiss's paper)

Intraclass correlation coefficients 
                         type  ICC    F df1 df2       p lower bound upper bound
Single_raters_absolute   ICC1 0.17  1.8   5  18 0.16477      -0.133        0.72
Single_random_raters     ICC2 0.29 11.0   5  15 0.00013       0.019        0.76
Single_fixed_raters      ICC3 0.71 11.0   5  15 0.00013       0.342        0.95
Average_raters_absolute ICC1k 0.44  1.8   5  18 0.16477      -0.884        0.91
Average_random_raters   ICC2k 0.62 11.0   5  15 0.00013       0.071        0.93
Average_fixed_raters    ICC3k 0.91 11.0   5  15 0.00013       0.676        0.99

As can be seen, considering raters as fixed effects (hence not trying to generalize to a wider pool of raters) would yield a much higher value for the homogeneity of the measurement. (Similar results could be obtained with the irr package (icc()), although we must play with the different option for model type and unit of analysis.)

What do the ANOVA approach tell us? We need to fit two models to get the relevant mean squares:

a one-way model that considers subject only; this allows to separate the targets being rated (between-group MS, BMS) and get an estimate of the within-error term (WMS)
a two-way model that considers subject + rater + their interaction (when there's no replications, this last term will be confounded with the residuals); this allows to estimate the rater main effect (JMS) which can be accounted for if we want to use a random effects model (i.e., we'll add it to the total variability)

No need to look at the F-test, only MSs are of interest here.

library(reshape)
sf.df <- melt(sf, varnames=c("Subject", "Rater"))
anova(lm(value ~ Subject, sf.df))
anova(lm(value ~ Subject*Rater, sf.df))

Now, we can assemble the different pieces in an extended ANOVA Table which looks like the one shown below (this is Table 3 in Shrout and Fleiss's paper):

_{(source: mathurl.com)}

where the first two rows come from the one-way model, whereas the next two ones come from the two-way ANOVA.

It is easy to check all formulae in Shrout and Fleiss's article, and we have everything we need to estimate the reliability for a single assessment. What about the reliability for the average of multiple assessments (which often is the quantity of interest in inter-rater studies)? Following Hays and Revicki (2005), it can be obtained from the above decomposition by just changing the total MS considered in the denominator, except for the two-way random-effects model for which we have to rewrite the ratio of MSs.

In case of ICC(1,1)=(BMS-WMS)/(BMS+(k-1)•WMS), the overall reliability is computed as (BMS-WMS)/BMS=0.443.
For the ICC(2,1)=(BMS-EMS)/(BMS+(k-1)•EMS+k•(JMS-EMS)/N), the overall reliability is (N•(BMS-EMS))/(N•BMS+JMS-EMS)=0.620.
Finally, for the ICC(3,1)=(BMS-EMS)/(BMS+(k-1)•EMS), we have a reliability of (BMS-EMS)/BMS=0.909.

Again, we find that the overall reliability is higher when considering raters as fixed effects.

References

Shrout, P.E. and Fleiss, J.L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420-3428.
Hays, R.D. and Revicki, D. (2005). Reliability and validity (including responsiveness). In Fayers, P. and Hays, R.D. (eds.), Assessing Quality of Life in Clinical Trials, 2nd ed., pp. 25-39. Oxford University Press.

Best Answer

Related Solutions

Solved – Intraclass correlation and aggregation

ANOVA – Intraclass Correlation Coefficient vs. F-Test (One-Way)

References

Related Question