Solved – Interrater reliability for events in a time series with uncertainty about event time

agreement-statisticsreliabilitytime series

I have multiple independent coders who are trying to identify events in a time series — in this case, watching video of face-to-face conversation and looking for particular nonverbal behaviors (e.g., head nods) and coding the time and category of each event. This data could reasonable be treated as a discrete-time series with a high sampling rate (30 frames/second) or as a continuous-time series, whichever is easier to work with.

I'd like to compute some measure of inter-rater reliability, but I expect there to be some uncertainty in when events occurred; that is, I expect that one coder might, for example, code that a particular movement began a quarter second later than other coders thought it started. These are rare events, if that helps; typically at least several seconds (hundreds of video frames) between events.

Is there a good way of assessing inter-rater reliability that looks at both of these kinds of agreement and disagreement: (1) do raters agree on what event occurred (if any), and (2) do they agree on when it occurred? The second is important to me because I'm interested in looking at the timing of these events relative to other things happening in the conversation, like what people are saying.

Standard practice in my field seems to be to divide things up into time slices, say 1/4 of a second or so, aggregate the events each coder reported per time slice, then compute Cohen's kappa or some similar measure. But the choice of slice duration is ad-hoc, and I don't get a good idea of uncertainty in time of events.

Best thought I have so far is that I could compute some kind of reliability curve; something like kappa as a function of the size of the window within which I consider two events as being coded at the same time. I'm not really sure where to go from there, though…

Best Answer

Here's a couple of ways to think about.

1

A) You could treat each full sequence of codings as a ordered set of events (i.e. ["head nod", "head shake", "head nod", "eyebrow raised"] and ["head nod", "head shake", "eyebrow raised"]), then align the sequences using an algorithm that made sense to you ( http://en.wikipedia.org/wiki/Sequence_alignment ). You could then compute inter coder reliability for the entire sequence.

B) Then, again using the aligned sequences, you could compare when they said an event happened, given that they both observed the event.

2) Alternately, you could model this as a Hidden Markov Model, and use something like the Baumn-Welch algorithm to impute the probabilities that, given some actual event, each coder actually coded the data correctly. http://en.wikipedia.org/wiki/Baum-Welch_algorithm

Related Question