Solved – Assumption of additivity for intra-class correlation

agreement-statisticsintraclass-correlationreliability

My question concerns the assumption of additivity for intraclass correlation. I shall first explain what I have done and then end with my questions.

I want to calculate inter-rater reliability using intra-class correlation so I can report an overall coefficient (as done in previous similar research), and perhaps replace a rater if their judgements correspond poorly to the other raters. I have five raters and they have each rated video recordings of facial and vocal expressions of the same (randomly sampled) 4 participants in an experiment where the participants watched different emotional films.

Raters make 18 ratings per film. These ratings are Likert-type (generally ranging from 1-6 but for some measures 1-4) of the intensity of different 6 facial emotional expressions (anger, fear etc), the intensity of facial expression overall, and the number (frequency ratings) and intensity (Likert ratings) of positive and negative words and sounds, and level of overall vocal expressiveness.

There are 16 films, so there is a total of 288 variables per rater, per participant rated. I have organised my data into four files, one per participant being rated, with each rater as a column and the 288 variables as rows. As I am calculating inter-rater reliability, I am interested in the similarity of the raters overall, and not any other (e.g. film) effects.

I have calculated the ICC using the mixed model because all judges rate all targets, which are a random sample (as per http://faculty.chass.ncsu.edu/garson/PA765/reliab.htm#rater)

Questions:
The assumption of additivity states that each item should be linearly related to the total score. However I don’t think that the concept of a total score really applies, although I may be wrong. Tukey’s test of non-additivity tests the null hypothesis that there is no multiplicative interaction between cases and items.

  • Could somebody please explain this to
    me in simple terms?

I found a significant Tukey’s test value, so I tried removing the overall facial and vocal ratings for each film, as I thought this perhaps violated the requirement that each item contributes to the total score. However Tukey’s test remained significant. So just as a little experiment, I removed 282 variables, leaving me with ratings of the 6 possible facial emotions for a single film. Tukey’s test was still significant!

  • Is Tukey’s test of non-additivity
    relevant to my problem?

  • If yes, what should I do about it
    being significant?

Best Answer

What you describes about Tukey's nonadditivity test sounds good to me. In effect, it allows to test for an item by rater interaction. Some words of caution, though:

  • Tukey's nonadditivity test effectively allows to test for a linear-by-linear product of two factor main effects.
  • The possibility of deriving a total score is irrelevant here, as this particular Tukey's test can be applied in any randomized block design, as described on Stata FAQ, for example.
  • It applies in situation where you have a single observation per cell, that is each rater assess only one item (no replicates).

You might recall that the interaction term is confounded with the error term when there're no replicates in an ANOVA design; in inter-rater studies, it means we have only one rating for each rater x item cell. Tukey's test in this case provide a 1-DF test for assessing any deviation from additivity, which is a common assumption to interpret a main effect in two-factor models. Here is a tutorial describing how it works.

I must admit I never used it when computing ICC, and I spent some times trying to reproduce Dave Garson's results with R. This led me to the following two papers that showed that Tukey's nonadditivity test might not be the "best" test to use as it will fail to recover a true interaction effect (e.g., where some raters exhibit an opposite rating behavior compared to the rest of the raters) when there's no main effect of the target of the ratings (e.g., marks given to items):

  1. Lahey, M.A., Downey, R.G., and Saal, F.E. (1983). Intraclass Correlations: There's More There Than Meets the Eye. Psychological Bulletin, 93(3), 586-595.
  2. Johnson, D.E. and Graybill, F.A. (1972). An analysis of a two-way model with interaction and no replication. Journal of the American Statistical Association, 67, 862-868.
  3. Hegemann, V. and Johnson, D.E. (1976). The power of two tests for nonadditivity. Journal of the American Statistical Association, 71(356), 945-948.

(I'm very sorry but I couldn't find ungated PDF version of those papers. The first one is really a must-read one.)

About your particular design, you considered raters as fixed effects (hence the use of Shrout and Fleiss's type 3 ICC, i.e. mixed model approach). In this case, Lahey et al. (1) stated that you face a situation of nonorthogonal interaction components (i.e., the interaction is not independent of other effect) and a biased estimate of the rating effect -- but, this for the case where you have a single observation per cell (ICC(3,1)). With multiple ratings per items, estimating ICC(3,k) requires the "assumption of nonsignificance of the interaction. In this case, the ANOVA effects are neither theoretically nor mathematically independent, and without adequate justification, the assumption of no interaction is very tenuous."

In other words, such an interaction test aims at offering you diagnostic information. My opinion is that you can go on with you ICC, but be sure to check that (a) there's a significant effect for the target of ratings (otherwise, it would mean the reliability of measurements is low), (b) no rater systematically deviates from others' ratings (this can be done graphically, or based on the residuals of your ANOVA model).


More technical details are given below.

The alternative test that is proposed is called the characteristic root test of the interaction (2,3). Consider a multiplicative interaction model of the form (here, as an effect model, that is we use parameters that summarize deviations from the grand mean):

$$\mu_{ij}=\mu + \tau_i + \beta_j + \lambda\alpha_i\gamma_j + \varepsilon_{ij}$$

with $\tau$ ($i=1,\dots,t$) the effect due to targets/items, $\beta$ ($j=1,\dots,b$) the effect of raters, $\alpha\gamma$ the interaction targets x raters, and the usual assumptions for the distribution of errors and parameters constraints. We can compute the largest characteristic root of $Z'Z$ or $ZZ'$, where $Z=z_{ij}=y_{ij}-y_{i\cdot}-y_{\cdot j}+y_{\cdot\cdot}$ is the $t \times b$ matrix of residuals from an additive model.

The test then relies on the idea of using $\lambda_1/\text{RSS}$ as a test statistic ($H_0:\, \lambda=0$) where $\lambda_1$ is the largest nonzero characteristic root of $ZZ'$ (or $Z'Z$), and RSS equals the residual sum of squares from an additive model (2).

Related Question