Solved – Interrater reliability for events in a time series with uncertainty about event time

agreement-statisticsreliabilitytime series

I have multiple independent coders who are trying to identify events in a time series — in this case, watching video of face-to-face conversation and looking for particular nonverbal behaviors (e.g., head nods) and coding the time and category of each event. This data could reasonable be treated as a discrete-time series with a high sampling rate (30 frames/second) or as a continuous-time series, whichever is easier to work with.

I'd like to compute some measure of inter-rater reliability, but I expect there to be some uncertainty in when events occurred; that is, I expect that one coder might, for example, code that a particular movement began a quarter second later than other coders thought it started. These are rare events, if that helps; typically at least several seconds (hundreds of video frames) between events.

Is there a good way of assessing inter-rater reliability that looks at both of these kinds of agreement and disagreement: (1) do raters agree on what event occurred (if any), and (2) do they agree on when it occurred? The second is important to me because I'm interested in looking at the timing of these events relative to other things happening in the conversation, like what people are saying.

Standard practice in my field seems to be to divide things up into time slices, say 1/4 of a second or so, aggregate the events each coder reported per time slice, then compute Cohen's kappa or some similar measure. But the choice of slice duration is ad-hoc, and I don't get a good idea of uncertainty in time of events.

Best thought I have so far is that I could compute some kind of reliability curve; something like kappa as a function of the size of the window within which I consider two events as being coded at the same time. I'm not really sure where to go from there, though…

Best Answer

Here's a couple of ways to think about.

A) You could treat each full sequence of codings as a ordered set of events (i.e. ["head nod", "head shake", "head nod", "eyebrow raised"] and ["head nod", "head shake", "eyebrow raised"]), then align the sequences using an algorithm that made sense to you ( http://en.wikipedia.org/wiki/Sequence_alignment ). You could then compute inter coder reliability for the entire sequence.

B) Then, again using the aligned sequences, you could compare when they said an event happened, given that they both observed the event.

2) Alternately, you could model this as a Hidden Markov Model, and use something like the Baumn-Welch algorithm to impute the probabilities that, given some actual event, each coder actually coded the data correctly. http://en.wikipedia.org/wiki/Baum-Welch_algorithm

Related Solutions

Solved – Intra- and inter-rater reliability on the same data

I assume that A through D are different symptoms, say, and 1 and 2 are the two raters. As you tagged this in Stata, I will build a Stata example. Let us first simulate some data: we have a bunch of subjects with two uncorrelated traits, and a battery of questions, tapping upon these traits. The two raters have different sensitivities to each of the traits: the first rater is a tad more likely than the second rater to give a positive answer on question A, but slightly less likely to give a positive answer on question B, etc.

    clear
    set seed 10101
    set obs 200

    * generate orthogonal individual traits
    generate trait1 = rnormal()
    generate trait2 = rnormal()

    * raters' interecepts for individual questions
    local q1list 0.3 0.7 -0.2 -0.4
    local q2list 0.5 0.5 0 -0.5

    * prefixes
    local letters a b c d

    forvalues k = 1/4 {
      local thisletter : word `k' of `letters'
      local rater1     : word `k' of `q1list'
      local rater2     : word `k' of `q2list'
      generate byte `thisletter'1 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater1' ) 
      generate byte `thisletter'2 = ( `k'/3*trait1 + (3-`k')/5*trait2 + 0.3*rnormal() > `rater2' ) 
    }

This should produce something like

      . list a1-d2 in 1/5, noobs

       +---------------------------------------+
      | a1   a2   b1   b2   c1   c2   d1   d2 |
      |---------------------------------------|
      |  1    1    0    0    1    0    1    1 |
      |  0    0    0    0    0    1    1    1 |
      |  0    0    0    0    0    0    0    0 |
      |  1    0    0    1    1    1    1    1 |
      |  0    0    0    1    1    1    1    1 |
      +---------------------------------------+

which I hope resembles your data, at least in terms of the existing variables.

A fully non-parametric summary of the inter-rater agreement can be constructed by converting the binary representation into a decimal representation. The outcome a1=0, b1=0, c1=0, c4=0 is 0000b=0; the outcome in the first observation is 1011b = 11, etc. Let us produce this encoding:

      generate int pattern1 = 0
      generate int pattern2 = 0
      forvalues k = 1/4 {
        local thisletter : word `k' of `letters'
        replace pattern1 = pattern1 + `thisletter'1 * 2^(4-`k')
        replace pattern2 = pattern2 + `thisletter'2 * 2^(4-`k')
        tab pattern*
      }

This should produce something like

      . list  a1- d2 pat* in 1/5, noobs

        +-------------------------------------------------------------+
        | a1   a2   b1   b2   c1   c2   d1   d2   pattern1   pattern2 |
        |-------------------------------------------------------------|
        |  1    1    0    0    1    0    1    1         11          9 |
        |  0    0    0    0    0    1    1    1          1          3 |
        |  0    0    0    0    0    0    0    0          0          0 |
        |  1    0    0    1    1    1    1    1         11          7 |
        |  0    0    0    1    1    1    1    1          3          7 |
        +-------------------------------------------------------------+

Now, these patterns are perfectly comparable using kap:

        . kap pattern1 pattern2


        Agreement   Exp.Agrmt    Kappa     Std. Err.      Z        Prob>Z
        -----------------------------------------------------------------
        54.00%        17.91%     0.4396     0.0308      14.25      0.0000

You can play with the sample size or with the differences between raters to produce a non-significant answer :). This kappa suffers from a serious drawback: it does not reflect the fact of having some common items: the patterns 0001 and 0000, even though they match by 75%, would be considered non-matches within this approach. So it is an extremely conservative measure of the inter-rater agreement.

To get fair estimates of all the ICCs, you would need to run a cross-classified mixed model. Let us first reshape the data to make it possible:

        generate long id = _n

        * reshape the raters
        reshape long a b c d , i(id) j(rater 1 2)

        * reshape the items
        forvalues k = 1/4 {
          local thisletter : word `k' of `letters'
          rename `thisletter' q`k'
        }
        reshape long q , i(id rater) j(item 1 2 3 4)

Now, we can run xtmelogit (or gllamm if you like it better) on this data:

        . xtmelogit q || _all : R.rater || _all: R.item || _all: R.id, nolog

        Note: factor variables specified; option laplace assumed

        Mixed-effects logistic regression               Number of obs      =      1600
        Group variable: _all                            Number of groups   =         1

                                                        Obs per group:
        min =      1600
        avg =    1600.0
        max =      1600

        Integration points =   1                        Wald chi2(0)       =         .
        Log likelihood = -697.55526                     Prob > chi2        =         .

        ------------------------------------------------------------------------------
        q            |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
        _cons        |  -.7795316   .9384147    -0.83   0.406    -2.618791    1.059727
        ------------------------------------------------------------------------------

        ------------------------------------------------------------------------------
        Random-effects Parameters    |   Estimate   Std. Err.     [95% Conf. Interval]
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.rater)                  |   .1407056   .1627763      .0145745    1.358408
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.item)                   |   1.797133   .6461083      .8882897    3.635847
        -----------------------------+------------------------------------------------
        _all: Identity               |
        sd(R.id)                     |    3.18933   .2673165      2.706171    3.758751
        ------------------------------------------------------------------------------
        LR test vs. logistic regression:     chi2(3) =   793.71   Prob > chi2 = 0.0000

        Note: LR test is conservative and provided only for reference.
        Note: log-likelihood calculations are based on the Laplacian approximation.

This is a cross-classified model with three random effects: subjects, raters and items, assuming that they are uncorrelated (which is wrong for this data; see below). Let us now estimate the ICCs:

        . local Vrater ( exp(2*_b[lns1_1_1:_cons]) )
        . local Vitem ( exp(2*_b[lns1_2_1:_cons]) )
        . local Vid ( exp(2*_b[lns1_3_1:_cons]) )
        . nlcom `Vrater' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        -----------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |   .0011847   .0027384     0.43   0.665    -.0041824    .0065519
        -----------------------------------------------------------------------

        . nlcom `Vid' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        ------------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |   .6086839   .0903816     6.73   0.000     .4315393    .7858285
        ------------------------------------------------------------------------

        . nlcom `Vitem' / (`Vrater' + `Vitem' + `Vid' + _pi*_pi/3 )

        -----------------------------------------------------------------------
        q     |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------+----------------------------------------------------------------
        _nl_1 |    .193265   .1121376     1.72   0.085    -.0265206    .4130506
        -----------------------------------------------------------------------

(Hint: I figured out the names of the parameters by matrix list e(b).)

These are ICCs corresponding to raters, subjects and items, respectively. The zero ICC of the raters actually makes sense in the context of how the data were generated: there is no systematic effect in the sense that one rater consistently rates the condition better or worse than the other rater. There is an interaction between rater and item, but the model does not reflect it. True to life would be something like

 xtmelogit q ibn.item##ibn.rater, nocons || id:

With this specification, you would have to get ICCs by an even more complicated mix of the variance components and the point estimates from the fixed effects part of the model.

If you have the patience (or a powerful computer), you can specify intp(7) or something like that to get an approximation more accurate than the Laplace approximation (a single point at the mode of the distribution of the random effects).

Solved – interrater reliability for multiple variables

Actually, not in SPSS, but in R, very simply, you could use the package irr. Check this link http://www.cookbook-r.com/Statistical_analysis/Inter-rater_reliability/

Best Answer

Related Solutions

Solved – Intra- and inter-rater reliability on the same data

Solved – interrater reliability for multiple variables

Related Question