Solved – Inter-rater reliability using Intra-class correlation with ratings for multiple objects on multiple properties

agreement-statisticsintraclass-correlationreliabilityspss

Initial Question:

I am trying to calculate inter-rater reliability. Previous researchers in this area have used intraclass correlation. SPSS has options for two-way random, mixed random and one-way random models. SPSS help says choose the right one based on whether the 'people effects are random' and 'item effects are random', can anybody please explain to me what these terms mean?

Also I wanted to know how to structure my data file – I think I need to put each raters' rating of each item in a separate column (rather than each rater on a separate row), is this right?

UPDATED question:

Thanks to the help I have received, I have visited the websites, which were useful. I have decided that I need to use the mixed model, and I have transposed my files so that each rater is a column.

Additivity:

The websites also mentioned some assumptions, one of which is additivity, tested with Tukey's test of non-additivity. Apparently, this tests the hypothesis that there is no multiplicative interaction between the cases and the items. I think maybe I have too many variables in my calculation, because the examples others use have a lot less, e.g. each rater making only one rating, as described below, my raters have made many more ratings than that.

What does the additivity assumption mean?
What should be done when the test of non-additivity is statistically significant?

Inter-rater reliability with multiple objects and properties

I will explain in more detail what I have done. I have five raters and they have rated facial and vocal expressions of participants watching different emotional films. Raters make 18 ratings per film, and there are 16 films, so there is a total of 288 variables per rater, per participant rated. Each rater has rated the same 4 participants each, and I have placed these participants into different files. I thought about making seperate files for each of the 16 films, but that would be so many files.

How should I calculate ICC and organise my files in this specific situation?
Should I calculate a mean of individual ICCs or perhaps select a random sample?

Best Answer

These are distinct ways of accounting for raters or items variance in overall variance, following Shrout and Fleiss (1979) (cases 1 to 3 in Table 1):

One-way random effects model: raters are considered as sampled from a larger pool of potential raters, hence they are treated as random effects; the ICC is then interpreted as the % of total variance accounted for by subjects/items variance. This is called the consistency ICC.
Two-way random effects model: both factors -- raters and items/subjects -- are viewed as random effects, and we have two variance components (or mean squares) in addition to the residual variance; we further assume that raters assess all items/subjects; the ICC gives in this case the % of variance attributable to raters + items/subjects.
Two-way mixed model: contrary to the one-way approach, here raters are considered as fixed effects (no generalization beyond the sample at hand) but items/subjects are treated as random effects; the unit of analysis may be the individual or the average ratings.

I would say raters have to be entered as columns, although I'm not a specialist of SPSS. Dave Garson's dedicated website is worth looking at for those working with SPSS. There is also a complete on-line tutorial on reliability analysis (Robert A. Yaffee) [archived version].

For theoretical consideration about the mixed-effect approach, please consider reading my answer to this related question: Reliability in Elicitation Exercise.

Related Solutions

Solved – Assumption of additivity for intra-class correlation

What you describes about Tukey's nonadditivity test sounds good to me. In effect, it allows to test for an item by rater interaction. Some words of caution, though:

Tukey's nonadditivity test effectively allows to test for a linear-by-linear product of two factor main effects.
The possibility of deriving a total score is irrelevant here, as this particular Tukey's test can be applied in any randomized block design, as described on Stata FAQ, for example.
It applies in situation where you have a single observation per cell, that is each rater assess only one item (no replicates).

You might recall that the interaction term is confounded with the error term when there're no replicates in an ANOVA design; in inter-rater studies, it means we have only one rating for each rater x item cell. Tukey's test in this case provide a 1-DF test for assessing any deviation from additivity, which is a common assumption to interpret a main effect in two-factor models. Here is a tutorial describing how it works.

I must admit I never used it when computing ICC, and I spent some times trying to reproduce Dave Garson's results with R. This led me to the following two papers that showed that Tukey's nonadditivity test might not be the "best" test to use as it will fail to recover a true interaction effect (e.g., where some raters exhibit an opposite rating behavior compared to the rest of the raters) when there's no main effect of the target of the ratings (e.g., marks given to items):

Lahey, M.A., Downey, R.G., and Saal, F.E. (1983). Intraclass Correlations: There's More There Than Meets the Eye. Psychological Bulletin, 93(3), 586-595.
Johnson, D.E. and Graybill, F.A. (1972). An analysis of a two-way model with interaction and no replication. Journal of the American Statistical Association, 67, 862-868.
Hegemann, V. and Johnson, D.E. (1976). The power of two tests for nonadditivity. Journal of the American Statistical Association, 71(356), 945-948.

(I'm very sorry but I couldn't find ungated PDF version of those papers. The first one is really a must-read one.)

About your particular design, you considered raters as fixed effects (hence the use of Shrout and Fleiss's type 3 ICC, i.e. mixed model approach). In this case, Lahey et al. (1) stated that you face a situation of nonorthogonal interaction components (i.e., the interaction is not independent of other effect) and a biased estimate of the rating effect -- but, this for the case where you have a single observation per cell (ICC(3,1)). With multiple ratings per items, estimating ICC(3,k) requires the "assumption of nonsignificance of the interaction. In this case, the ANOVA effects are neither theoretically nor mathematically independent, and without adequate justification, the assumption of no interaction is very tenuous."

In other words, such an interaction test aims at offering you diagnostic information. My opinion is that you can go on with you ICC, but be sure to check that (a) there's a significant effect for the target of ratings (otherwise, it would mean the reliability of measurements is low), (b) no rater systematically deviates from others' ratings (this can be done graphically, or based on the residuals of your ANOVA model).

More technical details are given below.

The alternative test that is proposed is called the characteristic root test of the interaction (2,3). Consider a multiplicative interaction model of the form (here, as an effect model, that is we use parameters that summarize deviations from the grand mean):

$$\mu_{ij}=\mu + \tau_i + \beta_j + \lambda\alpha_i\gamma_j + \varepsilon_{ij}$$

with $\tau$ ($i=1,\dots,t$) the effect due to targets/items, $\beta$ ($j=1,\dots,b$) the effect of raters, $\alpha\gamma$ the interaction targets x raters, and the usual assumptions for the distribution of errors and parameters constraints. We can compute the largest characteristic root of $Z'Z$ or $ZZ'$, where $Z=z_{ij}=y_{ij}-y_{i\cdot}-y_{\cdot j}+y_{\cdot\cdot}$ is the $t \times b$ matrix of residuals from an additive model.

The test then relies on the idea of using $\lambda_1/\text{RSS}$ as a test statistic ($H_0:\, \lambda=0$) where $\lambda_1$ is the largest nonzero characteristic root of $ZZ'$ (or $Z'Z$), and RSS equals the residual sum of squares from an additive model (2).

Solved – Intra- and inter-rater reliability on the same data

I assume that A through D are different symptoms, say, and 1 and 2 are the two raters. As you tagged this in Stata, I will build a Stata example. Let us first simulate some data: we have a bunch of subjects with two uncorrelated traits, and a battery of questions, tapping upon these traits. The two raters have different sensitivities to each of the traits: the first rater is a tad more likely than the second rater to give a positive answer on question A, but slightly less likely to give a positive answer on question B, etc.

    clear
    set seed 10101
    set obs 200

    * generate orthogonal individual traits
    generate trait1 = rnormal()
    generate trait2 = rnormal()

    * raters' interecepts for individual questions
    local q1list 0.3 0.7 -0.2 -0.4
    local q2list 0.5 0.5 0 -0.5

    * prefixes
    local letters a b c d

    forvalues k = 1/4 {
      local thisletter : word `k' of `letters'
      local rater1     : word `k' of `q1list'
      local rater2     : word `k' of `q2list'
      generate byte `thisletter'1 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater1' ) 
      generate byte `thisletter'2 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater2' ) 
    }

This should produce something like

      . list a1-d2 in 1/5, noobs

       +---------------------------------------+
      | a1   a2   b1   b2   c1   c2   d1   d2 |
      |---------------------------------------|
      |  1    1    0    0    1    0    1    1 |
      |  0    0    0    0    0    1    1    1 |
      |  0    0    0    0    0    0    0    0 |
      |  1    0    0    1    1    1    1    1 |
      |  0    0    0    1    1    1    1    1 |
      +---------------------------------------+

which I hope resembles your data, at least in terms of the existing variables.

A fully non-parametric summary of the inter-rater agreement can be constructed by converting the binary representation into a decimal representation. The outcome a1=0, b1=0, c1=0, c4=0 is 0000b=0; the outcome in the first observation is 1011b = 11, etc. Let us produce this encoding:

      generate int pattern1 = 0
      generate int pattern2 = 0
      forvalues k = 1/4 {
        local thisletter : word `k' of `letters'
        replace pattern1 = pattern1 + `thisletter'1 * 2^(4-`k')
        replace pattern2 = pattern2 + `thisletter'2 * 2^(4-`k')
        tab pattern*
      }

This should produce something like

      . list  a1- d2 pat* in 1/5, noobs

        +-------------------------------------------------------------+
        | a1   a2   b1   b2   c1   c2   d1   d2   pattern1   pattern2 |
        |-------------------------------------------------------------|
        |  1    1    0    0    1    0    1    1         11          9 |
        |  0    0    0    0    0    1    1    1          1          3 |
        |  0    0    0    0    0    0    0    0          0          0 |
        |  1    0    0    1    1    1    1    1         11          7 |
        |  0    0    0    1    1    1    1    1          3          7 |
        +-------------------------------------------------------------+

Now, these patterns are perfectly comparable using kap:

        . kap pattern1 pattern2


        Agreement   Exp.Agrmt    Kappa     Std. Err.      Z        Prob>Z
        -----------------------------------------------------------------
        54.00%        17.91%     0.4396     0.0308      14.25      0.0000

You can play with the sample size or with the differences between raters to produce a non-significant answer :). This kappa suffers from a serious drawback: it does not reflect the fact of having some common items: the patterns 0001 and 0000, even though they match by 75%, would be considered non-matches within this approach. So it is an extremely conservative measure of the inter-rater agreement.

To get fair estimates of all the ICCs, you would need to run a cross-classified mixed model. Let us first reshape the data to make it possible:

        generate long id = _n

        * reshape the raters
        reshape long a b c d , i(id) j(rater 1 2)

        * reshape the items
        forvalues k = 1/4 {
          local thisletter : word `k' of `letters'
          rename `thisletter' q`k'
        }
        reshape long q , i(id rater) j(item 1 2 3 4)

Now, we can run xtmelogit (or gllamm if you like it better) on this data:

        . xtmelogit q || _all : R.rater || _all: R.item || _all: R.id, nolog

        Note: factor variables specified; option laplace assumed

        Mixed-effects logistic regression               Number of obs      =      1600
        Group variable: _all                            Number of groups   =         1

                                                        Obs per group:
        min =      1600
        avg =    1600.0
        max =      1600

        Integration points =   1                        Wald chi2(0)       =         .
        Log likelihood = -697.55526                     Prob > chi2        =         .

        ------------------------------------------------------------------------------
        q            |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
        _cons        |  -.7795316   .9384147    -0.83   0.406    -2.618791    1.059727
        ------------------------------------------------------------------------------

        ------------------------------------------------------------------------------
        Random-effects Parameters    |   Estimate   Std. Err.     [95% Conf. Interval]
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.rater)                  |   .1407056   .1627763      .0145745    1.358408
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.item)                   |   1.797133   .6461083      .8882897    3.635847
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.id)                     |    3.18933   .2673165      2.706171    3.758751
        ------------------------------------------------------------------------------
        LR test vs. logistic regression:     chi2(3) =   793.71   Prob > chi2 = 0.0000

        Note: LR test is conservative and provided only for reference.
        Note: log-likelihood calculations are based on the Laplacian approximation.

This is a cross-classified model with three random effects: subjects, raters and items, assuming that they are uncorrelated (which is wrong for this data; see below). Let us now estimate the ICCs:

        . local Vrater ( exp(2*_b[lns1_1_1:_cons]) )
        . local Vitem ( exp(2*_b[lns1_2_1:_cons]) )
        . local Vid ( exp(2*_b[lns1_3_1:_cons]) )
        . nlcom `Vrater' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        -----------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |   .0011847   .0027384     0.43   0.665    -.0041824    .0065519
        -----------------------------------------------------------------------

        . nlcom `Vid' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        ------------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |   .6086839   .0903816     6.73   0.000     .4315393    .7858285
        ------------------------------------------------------------------------

        . nlcom `Vitem' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        -----------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |    .193265   .1121376     1.72   0.085    -.0265206    .4130506
        -----------------------------------------------------------------------

(Hint: I figured out the names of the parameters by matrix list e(b).)

These are ICCs corresponding to raters, subjects and items, respectively. The zero ICC of the raters actually makes sense in the context of how the data were generated: there is no systematic effect in the sense that one rater consistently rates the condition better or worse than the other rater. There is an interaction between rater and item, but the model does not reflect it. True to life would be something like

 xtmelogit q ibn.item##ibn.rater, nocons || id:

With this specification, you would have to get ICCs by an even more complicated mix of the variance components and the point estimates from the fixed effects part of the model.

If you have the patience (or a powerful computer), you can specify intp(7) or something like that to get an approximation more accurate than the Laplace approximation (a single point at the mode of the distribution of the random effects).