Solved – What to do in case of low inter-rater reliability (ICC)

agreement-statisticsreliability

Background: Eight doctors each rated the same 54 patients on a persuasiveness measure (1-7 Likert scale). The mean score on the persuasiveness measure will eventually be the outcome measure of my experiment.

Inter-rater reliability was quantified as the intraclass correlation coefficient (ICC), using the two-way random effects model with consistency. Unfortunately, the inter-rater reliability of the eight doctors was low (ICC = .350, single measures). Should I still run further planned analyses with these unreliable data? Or can it possibly be justified that I only include the doctors (i.e., raters) with the highest inter-rater reliability? I found out there are two doctors with a more acceptable inter-rater reliability (ICC = .718, N = 2), but I don't think this is enough reason to exclude the other doctors from analyses. I would really appreciate any references to literature that deals with this problem.

Best Answer

I'd rather answer on the grounds of the methodology itself, rather than how to "fix" the situation. In another context, I assisted in working on a ratings and classification system, and found that inter-rater agreement was disappointingly low. Two paths were considered

Change how rating agreements were defined and identify those who seemed to "understand" the task, or
Refine the definitions used, along with the guidance and examples provider to raters, so that they could more easily understand how to rate things.

In the first sceneario, the whole methodology and results could be rendered a waste simply because the inter-rater reliability was low. It indicated that either the original definitions were bad or that the raters were given poor instructions. If I proceeded along that path, I was sure to have problems.

In the second case, the agreement between raters was very good. Since they rated quite a lot of items, they could also give feedback with when they thought the original definitions and guidance were inadequate. In the end, the methodology was very reproducible.

Based on that, I would not yet modify your set of raters, but return to the original definitions and guidance. Any tinkering after the rating is a problem, though it can be useful as a quality check. There are sometimes raters that are going to do what they want, no matter the guidance given. With good statistical methods, it is easy to identify them and weight their contributions appropriately.

Now, if I'm mistaken and you don't plan to do further collection, i.e. your data is already collected and done, what you may do is PCA or something like it, and see if you can get a sense of how the different doctors (or patients) cluster.

Were the patients exposed to all of the doctors at the same time (e.g. through a video recording) or were they exposed sequentially, and had a chance to modify their presentation with each interaction? If the latter, then there could be issues with the patients, and not the doctors.

Related Solutions

Solved – Inter-rater reliability using Intra-class correlation with ratings for multiple objects on multiple properties

These are distinct ways of accounting for raters or items variance in overall variance, following Shrout and Fleiss (1979) (cases 1 to 3 in Table 1):

One-way random effects model: raters are considered as sampled from a larger pool of potential raters, hence they are treated as random effects; the ICC is then interpreted as the % of total variance accounted for by subjects/items variance. This is called the consistency ICC.
Two-way random effects model: both factors -- raters and items/subjects -- are viewed as random effects, and we have two variance components (or mean squares) in addition to the residual variance; we further assume that raters assess all items/subjects; the ICC gives in this case the % of variance attributable to raters + items/subjects.
Two-way mixed model: contrary to the one-way approach, here raters are considered as fixed effects (no generalization beyond the sample at hand) but items/subjects are treated as random effects; the unit of analysis may be the individual or the average ratings.

I would say raters have to be entered as columns, although I'm not a specialist of SPSS. Dave Garson's dedicated website is worth looking at for those working with SPSS. There is also a complete on-line tutorial on reliability analysis (Robert A. Yaffee) [archived version].

For theoretical consideration about the mixed-effect approach, please consider reading my answer to this related question: Reliability in Elicitation Exercise.

Solved – Intra- and inter-rater reliability on the same data

I assume that A through D are different symptoms, say, and 1 and 2 are the two raters. As you tagged this in Stata, I will build a Stata example. Let us first simulate some data: we have a bunch of subjects with two uncorrelated traits, and a battery of questions, tapping upon these traits. The two raters have different sensitivities to each of the traits: the first rater is a tad more likely than the second rater to give a positive answer on question A, but slightly less likely to give a positive answer on question B, etc.

    clear
    set seed 10101
    set obs 200

    * generate orthogonal individual traits
    generate trait1 = rnormal()
    generate trait2 = rnormal()

    * raters' interecepts for individual questions
    local q1list 0.3 0.7 -0.2 -0.4
    local q2list 0.5 0.5 0 -0.5

    * prefixes
    local letters a b c d

    forvalues k = 1/4 {
      local thisletter : word `k' of `letters'
      local rater1     : word `k' of `q1list'
      local rater2     : word `k' of `q2list'
      generate byte `thisletter'1 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater1' ) 
      generate byte `thisletter'2 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater2' ) 
    }

This should produce something like

      . list a1-d2 in 1/5, noobs

       +---------------------------------------+
      | a1   a2   b1   b2   c1   c2   d1   d2 |
      |---------------------------------------|
      |  1    1    0    0    1    0    1    1 |
      |  0    0    0    0    0    1    1    1 |
      |  0    0    0    0    0    0    0    0 |
      |  1    0    0    1    1    1    1    1 |
      |  0    0    0    1    1    1    1    1 |
      +---------------------------------------+

which I hope resembles your data, at least in terms of the existing variables.

A fully non-parametric summary of the inter-rater agreement can be constructed by converting the binary representation into a decimal representation. The outcome a1=0, b1=0, c1=0, c4=0 is 0000b=0; the outcome in the first observation is 1011b = 11, etc. Let us produce this encoding:

      generate int pattern1 = 0
      generate int pattern2 = 0
      forvalues k = 1/4 {
        local thisletter : word `k' of `letters'
        replace pattern1 = pattern1 + `thisletter'1 * 2^(4-`k')
        replace pattern2 = pattern2 + `thisletter'2 * 2^(4-`k')
        tab pattern*
      }

This should produce something like

      . list  a1- d2 pat* in 1/5, noobs

        +-------------------------------------------------------------+
        | a1   a2   b1   b2   c1   c2   d1   d2   pattern1   pattern2 |
        |-------------------------------------------------------------|
        |  1    1    0    0    1    0    1    1         11          9 |
        |  0    0    0    0    0    1    1    1          1          3 |
        |  0    0    0    0    0    0    0    0          0          0 |
        |  1    0    0    1    1    1    1    1         11          7 |
        |  0    0    0    1    1    1    1    1          3          7 |
        +-------------------------------------------------------------+

Now, these patterns are perfectly comparable using kap:

        . kap pattern1 pattern2


        Agreement   Exp.Agrmt    Kappa     Std. Err.      Z        Prob>Z
        -----------------------------------------------------------------
        54.00%        17.91%     0.4396     0.0308      14.25      0.0000

You can play with the sample size or with the differences between raters to produce a non-significant answer :). This kappa suffers from a serious drawback: it does not reflect the fact of having some common items: the patterns 0001 and 0000, even though they match by 75%, would be considered non-matches within this approach. So it is an extremely conservative measure of the inter-rater agreement.

To get fair estimates of all the ICCs, you would need to run a cross-classified mixed model. Let us first reshape the data to make it possible:

        generate long id = _n

        * reshape the raters
        reshape long a b c d , i(id) j(rater 1 2)

        * reshape the items
        forvalues k = 1/4 {
          local thisletter : word `k' of `letters'
          rename `thisletter' q`k'
        }
        reshape long q , i(id rater) j(item 1 2 3 4)

Now, we can run xtmelogit (or gllamm if you like it better) on this data:

        . xtmelogit q || _all : R.rater || _all: R.item || _all: R.id, nolog

        Note: factor variables specified; option laplace assumed

        Mixed-effects logistic regression               Number of obs      =      1600
        Group variable: _all                            Number of groups   =         1

                                                        Obs per group:
        min =      1600
        avg =    1600.0
        max =      1600

        Integration points =   1                        Wald chi2(0)       =         .
        Log likelihood = -697.55526                     Prob > chi2        =         .

        ------------------------------------------------------------------------------
        q            |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
        _cons        |  -.7795316   .9384147    -0.83   0.406    -2.618791    1.059727
        ------------------------------------------------------------------------------

        ------------------------------------------------------------------------------
        Random-effects Parameters    |   Estimate   Std. Err.     [95% Conf. Interval]
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.rater)                  |   .1407056   .1627763      .0145745    1.358408
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.item)                   |   1.797133   .6461083      .8882897    3.635847
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.id)                     |    3.18933   .2673165      2.706171    3.758751
        ------------------------------------------------------------------------------
        LR test vs. logistic regression:     chi2(3) =   793.71   Prob > chi2 = 0.0000

        Note: LR test is conservative and provided only for reference.
        Note: log-likelihood calculations are based on the Laplacian approximation.

This is a cross-classified model with three random effects: subjects, raters and items, assuming that they are uncorrelated (which is wrong for this data; see below). Let us now estimate the ICCs:

        . local Vrater ( exp(2*_b[lns1_1_1:_cons]) )
        . local Vitem ( exp(2*_b[lns1_2_1:_cons]) )
        . local Vid ( exp(2*_b[lns1_3_1:_cons]) )
        . nlcom `Vrater' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        -----------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |   .0011847   .0027384     0.43   0.665    -.0041824    .0065519
        -----------------------------------------------------------------------

        . nlcom `Vid' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        ------------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |   .6086839   .0903816     6.73   0.000     .4315393    .7858285
        ------------------------------------------------------------------------

        . nlcom `Vitem' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        -----------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |    .193265   .1121376     1.72   0.085    -.0265206    .4130506
        -----------------------------------------------------------------------

(Hint: I figured out the names of the parameters by matrix list e(b).)

These are ICCs corresponding to raters, subjects and items, respectively. The zero ICC of the raters actually makes sense in the context of how the data were generated: there is no systematic effect in the sense that one rater consistently rates the condition better or worse than the other rater. There is an interaction between rater and item, but the model does not reflect it. True to life would be something like

 xtmelogit q ibn.item##ibn.rater, nocons || id:

With this specification, you would have to get ICCs by an even more complicated mix of the variance components and the point estimates from the fixed effects part of the model.

If you have the patience (or a powerful computer), you can specify intp(7) or something like that to get an approximation more accurate than the Laplace approximation (a single point at the mode of the distribution of the random effects).

Best Answer

Related Solutions

Solved – Inter-rater reliability using Intra-class correlation with ratings for multiple objects on multiple properties

Solved – Intra- and inter-rater reliability on the same data

Related Question