These are distinct ways of accounting for raters or items variance in overall variance, following Shrout and Fleiss (1979) (cases 1 to 3 in Table 1):
- One-way random effects model: raters are considered as sampled from a larger pool of potential raters, hence they are treated as random effects; the ICC is then interpreted as the % of total variance accounted for by subjects/items variance. This is called the consistency ICC.
- Two-way random effects model: both factors -- raters and items/subjects -- are viewed as random effects, and we have two variance components (or mean squares) in addition to the residual variance; we further assume that raters assess all items/subjects; the ICC gives in this case the % of variance attributable to raters + items/subjects.
- Two-way mixed model: contrary to the one-way approach, here raters are considered as fixed effects (no generalization beyond the sample at hand) but items/subjects are treated as random effects; the unit of analysis may be the individual or the average ratings.
I would say raters have to be entered as columns, although I'm not a specialist of SPSS.
Dave Garson's dedicated website is worth looking at for those working with SPSS. There is also a complete on-line tutorial on reliability analysis (Robert A. Yaffee) [archived version].
For theoretical consideration about the mixed-effect approach, please consider reading my answer to this related question: Reliability in Elicitation Exercise.
What you describes about Tukey's nonadditivity test sounds good to me. In effect, it allows to test for an item by rater interaction. Some words of caution, though:
- Tukey's nonadditivity test effectively allows to test for a linear-by-linear product of two factor main effects.
- The possibility of deriving a total score is irrelevant here, as this particular Tukey's test can be applied in any randomized block design, as described on Stata FAQ, for example.
- It applies in situation where you have a single observation per cell, that is each rater assess only one item (no replicates).
You might recall that the interaction term is confounded with the error term when there're no replicates in an ANOVA design; in inter-rater studies, it means we have only one rating for each rater x item cell. Tukey's test in this case provide a 1-DF test for assessing any deviation from additivity, which is a common assumption to interpret a main effect in two-factor models. Here is a tutorial describing how it works.
I must admit I never used it when computing ICC, and I spent some times trying to reproduce Dave Garson's results with R. This led me to the following two papers that showed that Tukey's nonadditivity test might not be the "best" test to use as it will fail to recover a true interaction effect (e.g., where some raters exhibit an opposite rating behavior compared to the rest of the raters) when there's no main effect of the target of the ratings (e.g., marks given to items):
- Lahey, M.A., Downey, R.G., and Saal, F.E. (1983). Intraclass Correlations: There's More There Than Meets the Eye. Psychological Bulletin, 93(3), 586-595.
- Johnson, D.E. and Graybill, F.A. (1972). An analysis of a two-way model with interaction and no replication. Journal of the American Statistical Association, 67, 862-868.
- Hegemann, V. and Johnson, D.E. (1976). The power of two tests for nonadditivity. Journal of the American Statistical Association, 71(356), 945-948.
(I'm very sorry but I couldn't find ungated PDF version of those papers. The first one is really a must-read one.)
About your particular design, you considered raters as fixed effects (hence the use of Shrout and Fleiss's type 3 ICC, i.e. mixed model approach). In this case,
Lahey et al. (1) stated that you face a situation of nonorthogonal interaction components (i.e., the interaction is not independent of other effect) and a biased estimate of the rating effect -- but, this for the case where you have a single observation per cell (ICC(3,1)). With multiple ratings per items, estimating ICC(3,k) requires the "assumption of nonsignificance of the interaction. In this case, the ANOVA effects are neither theoretically nor mathematically independent, and without adequate justification, the assumption of no interaction is very tenuous."
In other words, such an interaction test aims at offering you diagnostic information.
My opinion is that you can go on with you ICC, but be sure to check that (a) there's a significant effect for the target of ratings (otherwise, it would mean the reliability of measurements is low), (b) no rater systematically deviates from others' ratings (this can be done graphically, or based on the residuals of your ANOVA model).
More technical details are given below.
The alternative test that is proposed is called the characteristic root test of the interaction (2,3). Consider a multiplicative interaction model of the form (here, as an effect model, that is we use parameters that summarize deviations from the grand mean):
$$\mu_{ij}=\mu + \tau_i + \beta_j + \lambda\alpha_i\gamma_j + \varepsilon_{ij}$$
with $\tau$ ($i=1,\dots,t$) the effect due to targets/items, $\beta$ ($j=1,\dots,b$) the effect of raters, $\alpha\gamma$ the interaction targets x raters, and the usual assumptions for the distribution of errors and parameters constraints. We can compute the largest characteristic root of $Z'Z$ or $ZZ'$, where $Z=z_{ij}=y_{ij}-y_{i\cdot}-y_{\cdot j}+y_{\cdot\cdot}$ is the $t \times b$ matrix of residuals from an additive model.
The test then relies on the idea of using $\lambda_1/\text{RSS}$ as a test statistic ($H_0:\, \lambda=0$) where $\lambda_1$ is the largest nonzero characteristic root of $ZZ'$ (or $Z'Z$), and RSS equals the residual sum of squares from an additive model (2).
Best Answer
The model you referenced in your question is called the "one-way model." It assumes that random row effects are the only systematic source of variance. In the case of inter-rater reliability, rows correspond to objects of measurement (e.g., subjects).
However, there are also "two-way models." These assume that there is variance associated with random row effects as well as random or fixed column effects. In the case of inter-rater reliability, columns correspond to sources of measurement (e.g., raters).
Given a two-way model, you can calculate one of four ICC coefficients: the single score consistency ICC(C,1), the average score consistency ICC(C,k), the single score agreement ICC(A,1), or the average score agreement ICC(A,k). Single score ICCs apply to single measurements $x_{ij}$ (e.g., individual raters), whereas average score ICCs apply to average measurements $\bar{x}_i$ (e.g., the mean of all raters). Consistency ICCs exclude the column variance from the denominator variance (e.g., allowing raters to vary around their own means), whereas agreement ICCs include the column variance in the denominator variance (e.g., requiring raters to vary around the same mean).
Here are the definitions if you assume a random column effect:
You can also estimate these values using mean squares from ANOVA:
You can calculate these coefficients in R using the irr package:
References
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.