Solved – Standard Error of Measurement (SEM) in Inter-rater reliability

agreement-statisticsreliability

I am currently writing my thesis about inter-rater reliability of a diagnostic tool between raters. I want to use the standard error of measurement with the formula:

SEM of rater 1 and rater 2 = SD * $\sqrt{1-ICC}$ where SD represents standard deviation and ICC represents the reliability of rater 1 and rater 2.

However, I cannot find what SD is necessary… Do I use the pooled SD of the 2 raters?

Best Answer

As Jeremy pointed out, there are multiple versions of the ICC, that reflect distinct ways of accounting for raters or items variance in overall variance. There's a nice summary of the use of Kappa and ICC indices for rater reliability in Computing Inter-Rater Reliability for Observational Data: An Overview and Tutorial, by Kevin A. Hallgren, and I discussed the different versions of ICC in a related post. Briefly, you have to decide whether your raters are considered as sampled from a larger pool of potential raters or as a fixed set of raters. In the first case, this means using a random effect model, while in the second raters will be treated as fixed effects. Likewise, items may be treated as either fixed or random units. Usually, we use a two-way random effects model (both raters and items are treated as random effects) to estimate relative agreement (or the ICC(2,k) version for absolute reliability, if you care about systematic error between raters). The SEM can be calculated from the square root of the mean square error from a one-way ANOVA, or from the sample standard deviation, as suggested in the other reply.

Note that the choice of SD doesn't matter as much as that of the ICC, since they can differ a lot depending on your sample size and the inherent variation of the ratings. Here are two examples of results obtained from the same dataset using the ICC2 (top) and ICC3 (bottom) approaches:

In R, there are many packages available, including psych (see ICC) and irr. I provide some examples of use of the former in my other answer, and I have some examples of the use of irr in a separate blog post on my site. Using Stata, the icc command provides all three ANOVA models and associated estimation for the ICC, as shown above.

Related Solutions

Solved – How to compute the standard error of measurement (SEM) from a reliability estimate

You should use the point estimate of the reliability, not the lower bound or whatsoever. I guess by lb/up you mean the 95% CI for the ICC (I don't have SPSS, so I cannot check myself)? It's unfortunate that we also talk of Cronbach's alpha as a "lower bound for reliability" since this might have confused you.

It should be noted that this formula is not restricted to the use of an estimate of ICC; in fact, you can plug in any "valid" measure of reliability (most of the times, it is Cronbach's alpha that is being used). Apart from the NCME tutorial that I linked to in my comment, you might be interested in this recent article:

Tighe et al. The standard error of measurement is a more appropriate measure of quality for postgraduate medical assessments than is reliability: an analysis of MRCP(UK) examinations. BMC Medical Education 2010, 10:40

Although it might seem to barely address your question at first sight, it has some additional material showing how to compute SEM (here with Cronbach's $\alpha$, but it is straightforward to adapt it with ICC); and, anyway, it's always interesting to look around to see how people use SEM.

Solved – Intra- and inter-rater reliability on the same data

I assume that A through D are different symptoms, say, and 1 and 2 are the two raters. As you tagged this in Stata, I will build a Stata example. Let us first simulate some data: we have a bunch of subjects with two uncorrelated traits, and a battery of questions, tapping upon these traits. The two raters have different sensitivities to each of the traits: the first rater is a tad more likely than the second rater to give a positive answer on question A, but slightly less likely to give a positive answer on question B, etc.

    clear
    set seed 10101
    set obs 200

    * generate orthogonal individual traits
    generate trait1 = rnormal()
    generate trait2 = rnormal()

    * raters' interecepts for individual questions
    local q1list 0.3 0.7 -0.2 -0.4
    local q2list 0.5 0.5 0 -0.5

    * prefixes
    local letters a b c d

    forvalues k = 1/4 {
      local thisletter : word `k' of `letters'
      local rater1     : word `k' of `q1list'
      local rater2     : word `k' of `q2list'
      generate byte `thisletter'1 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater1' ) 
      generate byte `thisletter'2 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater2' ) 
    }

This should produce something like

      . list a1-d2 in 1/5, noobs

       +---------------------------------------+
      | a1   a2   b1   b2   c1   c2   d1   d2 |
      |---------------------------------------|
      |  1    1    0    0    1    0    1    1 |
      |  0    0    0    0    0    1    1    1 |
      |  0    0    0    0    0    0    0    0 |
      |  1    0    0    1    1    1    1    1 |
      |  0    0    0    1    1    1    1    1 |
      +---------------------------------------+

which I hope resembles your data, at least in terms of the existing variables.

A fully non-parametric summary of the inter-rater agreement can be constructed by converting the binary representation into a decimal representation. The outcome a1=0, b1=0, c1=0, c4=0 is 0000b=0; the outcome in the first observation is 1011b = 11, etc. Let us produce this encoding:

      generate int pattern1 = 0
      generate int pattern2 = 0
      forvalues k = 1/4 {
        local thisletter : word `k' of `letters'
        replace pattern1 = pattern1 + `thisletter'1 * 2^(4-`k')
        replace pattern2 = pattern2 + `thisletter'2 * 2^(4-`k')
        tab pattern*
      }

This should produce something like

      . list  a1- d2 pat* in 1/5, noobs

        +-------------------------------------------------------------+
        | a1   a2   b1   b2   c1   c2   d1   d2   pattern1   pattern2 |
        |-------------------------------------------------------------|
        |  1    1    0    0    1    0    1    1         11          9 |
        |  0    0    0    0    0    1    1    1          1          3 |
        |  0    0    0    0    0    0    0    0          0          0 |
        |  1    0    0    1    1    1    1    1         11          7 |
        |  0    0    0    1    1    1    1    1          3          7 |
        +-------------------------------------------------------------+

Now, these patterns are perfectly comparable using kap:

        . kap pattern1 pattern2


        Agreement   Exp.Agrmt    Kappa     Std. Err.      Z        Prob>Z
        -----------------------------------------------------------------
        54.00%        17.91%     0.4396     0.0308      14.25      0.0000

You can play with the sample size or with the differences between raters to produce a non-significant answer :). This kappa suffers from a serious drawback: it does not reflect the fact of having some common items: the patterns 0001 and 0000, even though they match by 75%, would be considered non-matches within this approach. So it is an extremely conservative measure of the inter-rater agreement.

To get fair estimates of all the ICCs, you would need to run a cross-classified mixed model. Let us first reshape the data to make it possible:

        generate long id = _n

        * reshape the raters
        reshape long a b c d , i(id) j(rater 1 2)

        * reshape the items
        forvalues k = 1/4 {
          local thisletter : word `k' of `letters'
          rename `thisletter' q`k'
        }
        reshape long q , i(id rater) j(item 1 2 3 4)

Now, we can run xtmelogit (or gllamm if you like it better) on this data:

        . xtmelogit q || _all : R.rater || _all: R.item || _all: R.id, nolog

        Note: factor variables specified; option laplace assumed

        Mixed-effects logistic regression               Number of obs      =      1600
        Group variable: _all                            Number of groups   =         1

                                                        Obs per group:
        min =      1600
        avg =    1600.0
        max =      1600

        Integration points =   1                        Wald chi2(0)       =         .
        Log likelihood = -697.55526                     Prob > chi2        =         .

        ------------------------------------------------------------------------------
        q            |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
        _cons        |  -.7795316   .9384147    -0.83   0.406    -2.618791    1.059727
        ------------------------------------------------------------------------------

        ------------------------------------------------------------------------------
        Random-effects Parameters    |   Estimate   Std. Err.     [95% Conf. Interval]
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.rater)                  |   .1407056   .1627763      .0145745    1.358408
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.item)                   |   1.797133   .6461083      .8882897    3.635847
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.id)                     |    3.18933   .2673165      2.706171    3.758751
        ------------------------------------------------------------------------------
        LR test vs. logistic regression:     chi2(3) =   793.71   Prob > chi2 = 0.0000

        Note: LR test is conservative and provided only for reference.
        Note: log-likelihood calculations are based on the Laplacian approximation.

This is a cross-classified model with three random effects: subjects, raters and items, assuming that they are uncorrelated (which is wrong for this data; see below). Let us now estimate the ICCs:

        . local Vrater ( exp(2*_b[lns1_1_1:_cons]) )
        . local Vitem ( exp(2*_b[lns1_2_1:_cons]) )
        . local Vid ( exp(2*_b[lns1_3_1:_cons]) )
        . nlcom `Vrater' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        -----------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |   .0011847   .0027384     0.43   0.665    -.0041824    .0065519
        -----------------------------------------------------------------------

        . nlcom `Vid' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        ------------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |   .6086839   .0903816     6.73   0.000     .4315393    .7858285
        ------------------------------------------------------------------------

        . nlcom `Vitem' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        -----------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |    .193265   .1121376     1.72   0.085    -.0265206    .4130506
        -----------------------------------------------------------------------

(Hint: I figured out the names of the parameters by matrix list e(b).)

These are ICCs corresponding to raters, subjects and items, respectively. The zero ICC of the raters actually makes sense in the context of how the data were generated: there is no systematic effect in the sense that one rater consistently rates the condition better or worse than the other rater. There is an interaction between rater and item, but the model does not reflect it. True to life would be something like

 xtmelogit q ibn.item##ibn.rater, nocons || id:

With this specification, you would have to get ICCs by an even more complicated mix of the variance components and the point estimates from the fixed effects part of the model.

If you have the patience (or a powerful computer), you can specify intp(7) or something like that to get an approximation more accurate than the Laplace approximation (a single point at the mode of the distribution of the random effects).

Best Answer

Related Solutions

Solved – How to compute the standard error of measurement (SEM) from a reliability estimate

Solved – Intra- and inter-rater reliability on the same data

Related Question