Inter-Rater Reliability – Methods for Ordinal and Interval Data

agreement-statisticscohens-kappapsychometricsreliability

Which inter-rater reliability methods are most appropriate for ordinal or interval data?

I believe that "Joint probability of agreement" or "Kappa" are designed for nominal data. Whilst "Pearson" and "Spearman" can be used, they are mainly used for two raters (although they can be used for more than two raters).

What other measures are suitable for ordinal or interval data, i.e. more than two raters?

Best Answer

The Kappa ($\kappa$) statistic is a quality index that compares observed agreement between 2 raters on a nominal or ordinal scale with agreement expected by chance alone (as if raters were tossing up). Extensions for the case of multiple raters exist (2, pp. 284–291). In the case of ordinal data, you can use the weighted $\kappa$, which basically reads as usual $\kappa$ with off-diagonal elements contributing to the measure of agreement. Fleiss (3) provided guidelines to interpret $\kappa$ values but these are merely rules of thumbs.

The $\kappa$ statistic is asymptotically equivalent to the ICC estimated from a two-way random effects ANOVA, but significance tests and SE coming from the usual ANOVA framework are not valid anymore with binary data. It is better to use bootstrap to get confidence interval (CI). Fleiss (8) discussed the connection between weighted kappa and the intraclass correlation (ICC).

It should be noted that some psychometricians don't very much like $\kappa$ because it is affected by the prevalence of the object of measurement much like predictive values are affected by the prevalence of the disease under consideration, and this can lead to paradoxical results.

Inter-rater reliability for $k$ raters can be estimated with Kendall’s coefficient of concordance, $W$. When the number of items or units that are rated $n > 7$, $k(n − 1)W \sim \chi^2(n − 1)$. (2, pp. 269–270). This asymptotic approximation is valid for moderate value of $n$ and $k$ (6), but with less than 20 items $F$ or permutation tests are more suitable (7). There is a close relationship between Spearman’s $\rho$ and Kendall’s $W$ statistic: $W$ can be directly calculated from the mean of the pairwise Spearman correlations (for untied observations only).

Polychoric (ordinal data) correlation may also be used as a measure of inter-rater agreement. Indeed, they allow to

estimate what would be the correlation if ratings were made on a continuous scale,
test marginal homogeneity between raters.

In fact, it can be shown that it is a special case of latent trait modeling, which allows to relax distributional assumptions (4).

About continuous (or so assumed) measurements, the ICC which quantifies the proportion of variance attributable to the between-subject variation is fine. Again, bootstraped CIs are recommended. As @ars said, there are basically two versions -- agreement and consistency -- that are applicable in the case of agreement studies (5), and that mainly differ on the way sum of squares are computed; the “consistency” ICC is generally estimated without considering the Item×Rater interaction. The ANOVA framework is useful with specific block design where one wants to minimize the number of ratings (BIBD) -- in fact, this was one of the original motivation of Fleiss's work. It is also the best way to go for multiple raters. The natural extension of this approach is called the Generalizability Theory. A brief overview is given in Rater Models: An Introduction, otherwise the standard reference is Brennan's book, reviewed in Psychometrika 2006 71(3).

As for general references, I recommend chapter 3 of Statistics in Psychiatry, from Graham Dunn (Hodder Arnold, 2000). For a more complete treatment of reliability studies, the best reference to date is

Dunn, G (2004). Design and Analysis of Reliability Studies. Arnold. See the review in the International Journal of Epidemiology.

A good online introduction is available on John Uebersax's website, Intraclass Correlation and Related Methods; it includes a discussion of the pros and cons of the ICC approach, especially with respect to ordinal scales.

Relevant R packages for two-way assessment (ordinal or continuous measurements) are found in the Psychometrics Task View; I generally use either the psy, psych, or irr packages. There's also the concord package but I never used it. For dealing with more than two raters, the lme4 package is the way to go for it allows to easily incorporate random effects, but most of the reliability designs can be analysed using the aov() because we only need to estimate variance components.

References

J Cohen. Weighted kappa: Nominal scale agreement with provision for scales disagreement of partial credit. Psychological Bulletin, 70, 213–220, 1968.
S Siegel and Jr N John Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, Second edition, 1988.
J L Fleiss. Statistical Methods for Rates and Proportions. New York: Wiley, Second edition, 1981.
J S Uebersax. The tetrachoric and polychoric correlation coefficients. Statistical Methods for Rater Agreement web site, 2006. Available at: http://john-uebersax.com/stat/tetra.htm. Accessed February 24, 2010.
P E Shrout and J L Fleiss. Intraclass correlation: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428, 1979.
M G Kendall and B Babington Smith. The problem of m rankings. Annals of Mathematical Statistics, 10, 275–287, 1939.
P Legendre. Coefficient of concordance. In N J Salkind, editor, Encyclopedia of Research Design. SAGE Publications, 2010.
J L Fleiss. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613-619, 1973.

Related Solutions

Solved – Intra- and inter-rater reliability on the same data

I assume that A through D are different symptoms, say, and 1 and 2 are the two raters. As you tagged this in Stata, I will build a Stata example. Let us first simulate some data: we have a bunch of subjects with two uncorrelated traits, and a battery of questions, tapping upon these traits. The two raters have different sensitivities to each of the traits: the first rater is a tad more likely than the second rater to give a positive answer on question A, but slightly less likely to give a positive answer on question B, etc.

    clear
    set seed 10101
    set obs 200

    * generate orthogonal individual traits
    generate trait1 = rnormal()
    generate trait2 = rnormal()

    * raters' interecepts for individual questions
    local q1list 0.3 0.7 -0.2 -0.4
    local q2list 0.5 0.5 0 -0.5

    * prefixes
    local letters a b c d

    forvalues k = 1/4 {
      local thisletter : word `k' of `letters'
      local rater1     : word `k' of `q1list'
      local rater2     : word `k' of `q2list'
      generate byte `thisletter'1 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater1' ) 
      generate byte `thisletter'2 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater2' ) 
    }

This should produce something like

      . list a1-d2 in 1/5, noobs

       +---------------------------------------+
      | a1   a2   b1   b2   c1   c2   d1   d2 |
      |---------------------------------------|
      |  1    1    0    0    1    0    1    1 |
      |  0    0    0    0    0    1    1    1 |
      |  0    0    0    0    0    0    0    0 |
      |  1    0    0    1    1    1    1    1 |
      |  0    0    0    1    1    1    1    1 |
      +---------------------------------------+

which I hope resembles your data, at least in terms of the existing variables.

A fully non-parametric summary of the inter-rater agreement can be constructed by converting the binary representation into a decimal representation. The outcome a1=0, b1=0, c1=0, c4=0 is 0000b=0; the outcome in the first observation is 1011b = 11, etc. Let us produce this encoding:

      generate int pattern1 = 0
      generate int pattern2 = 0
      forvalues k = 1/4 {
        local thisletter : word `k' of `letters'
        replace pattern1 = pattern1 + `thisletter'1 * 2^(4-`k')
        replace pattern2 = pattern2 + `thisletter'2 * 2^(4-`k')
        tab pattern*
      }

This should produce something like

      . list  a1- d2 pat* in 1/5, noobs

        +-------------------------------------------------------------+
        | a1   a2   b1   b2   c1   c2   d1   d2   pattern1   pattern2 |
        |-------------------------------------------------------------|
        |  1    1    0    0    1    0    1    1         11          9 |
        |  0    0    0    0    0    1    1    1          1          3 |
        |  0    0    0    0    0    0    0    0          0          0 |
        |  1    0    0    1    1    1    1    1         11          7 |
        |  0    0    0    1    1    1    1    1          3          7 |
        +-------------------------------------------------------------+

Now, these patterns are perfectly comparable using kap:

        . kap pattern1 pattern2


        Agreement   Exp.Agrmt    Kappa     Std. Err.      Z        Prob>Z
        -----------------------------------------------------------------
        54.00%        17.91%     0.4396     0.0308      14.25      0.0000

You can play with the sample size or with the differences between raters to produce a non-significant answer :). This kappa suffers from a serious drawback: it does not reflect the fact of having some common items: the patterns 0001 and 0000, even though they match by 75%, would be considered non-matches within this approach. So it is an extremely conservative measure of the inter-rater agreement.

To get fair estimates of all the ICCs, you would need to run a cross-classified mixed model. Let us first reshape the data to make it possible:

        generate long id = _n

        * reshape the raters
        reshape long a b c d , i(id) j(rater 1 2)

        * reshape the items
        forvalues k = 1/4 {
          local thisletter : word `k' of `letters'
          rename `thisletter' q`k'
        }
        reshape long q , i(id rater) j(item 1 2 3 4)

Now, we can run xtmelogit (or gllamm if you like it better) on this data:

        . xtmelogit q || _all : R.rater || _all: R.item || _all: R.id, nolog

        Note: factor variables specified; option laplace assumed

        Mixed-effects logistic regression               Number of obs      =      1600
        Group variable: _all                            Number of groups   =         1

                                                        Obs per group:
        min =      1600
        avg =    1600.0
        max =      1600

        Integration points =   1                        Wald chi2(0)       =         .
        Log likelihood = -697.55526                     Prob > chi2        =         .

        ------------------------------------------------------------------------------
        q            |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
        _cons        |  -.7795316   .9384147    -0.83   0.406    -2.618791    1.059727
        ------------------------------------------------------------------------------

        ------------------------------------------------------------------------------
        Random-effects Parameters    |   Estimate   Std. Err.     [95% Conf. Interval]
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.rater)                  |   .1407056   .1627763      .0145745    1.358408
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.item)                   |   1.797133   .6461083      .8882897    3.635847
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.id)                     |    3.18933   .2673165      2.706171    3.758751
        ------------------------------------------------------------------------------
        LR test vs. logistic regression:     chi2(3) =   793.71   Prob > chi2 = 0.0000

        Note: LR test is conservative and provided only for reference.
        Note: log-likelihood calculations are based on the Laplacian approximation.

This is a cross-classified model with three random effects: subjects, raters and items, assuming that they are uncorrelated (which is wrong for this data; see below). Let us now estimate the ICCs:

        . local Vrater ( exp(2*_b[lns1_1_1:_cons]) )
        . local Vitem ( exp(2*_b[lns1_2_1:_cons]) )
        . local Vid ( exp(2*_b[lns1_3_1:_cons]) )
        . nlcom `Vrater' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        -----------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |   .0011847   .0027384     0.43   0.665    -.0041824    .0065519
        -----------------------------------------------------------------------

        . nlcom `Vid' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        ------------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |   .6086839   .0903816     6.73   0.000     .4315393    .7858285
        ------------------------------------------------------------------------

        . nlcom `Vitem' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        -----------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |    .193265   .1121376     1.72   0.085    -.0265206    .4130506
        -----------------------------------------------------------------------

(Hint: I figured out the names of the parameters by matrix list e(b).)

These are ICCs corresponding to raters, subjects and items, respectively. The zero ICC of the raters actually makes sense in the context of how the data were generated: there is no systematic effect in the sense that one rater consistently rates the condition better or worse than the other rater. There is an interaction between rater and item, but the model does not reflect it. True to life would be something like

 xtmelogit q ibn.item##ibn.rater, nocons || id:

With this specification, you would have to get ICCs by an even more complicated mix of the variance components and the point estimates from the fixed effects part of the model.

If you have the patience (or a powerful computer), you can specify intp(7) or something like that to get an approximation more accurate than the Laplace approximation (a single point at the mode of the distribution of the random effects).

Solved – Inter-rater reliability with many non-overlapping raters

Check out Krippendorff's alpha. It has several advantages over some other measures such as Cohen's Kappa, Fleiss's Kappa, Cronbach's alpha: it is robust to missing data (which I gather is the main concern you have); it is capable of dealing with more than 2 raters; and it can handle different types of scales ( nominal, ordinal, etc.), and it also accounts for chance agreements better than some other measures like Cohen's Kappa.

Calculation of Krippendorff's alpha is supported by several statistical software packages, including R (by the irr package), SPSS, etc.

Below are some relevant papers, that discuss Krippendorff's alpha including its properties and its implementation, and compare it with other measures:

Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77-89.
Krippendorff, K. (2004). Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research, 30(3), 411-433. doi: 10.1111/j.1468-2958.2004.tb00738.x
Chapter 3 in Krippendorff, K. (2013). Content Analysis: An Introduction to Its Methodology (3rd ed.): Sage.

There are some additional technical papers in Krippendorff's website

Best Answer

Related Solutions

Solved – Intra- and inter-rater reliability on the same data

Solved – Inter-rater reliability with many non-overlapping raters

Related Question