Solved – Which measure for inter-rater agreement for continuous data of 2 raters about multiple subjects in multiple situations

agreement-statistics

I've considered measures like Cohen's kappa (but data is continuous), intra class correlation (reliability, not agreement), standard correlation (will be high when one rater always rates consistently higher than the other rater)… but none seem to represent what I want it to.

I need a measure that will show % agreement between two raters, that have rated videos of multiple subjects in multiple situations and this in 8 different experiments.

So for example, for experiment 1 rater 1 rated 12 subjects in 3 situations and rater 2 did the same, for experiment 2 rater 1 rated 14 subjects in 4 situations and rater 2 did the same. And so on. Which measure would then be adequate to test how much the raters agree over all these 8 experiment? Or just per experiment, so that I can take calculate an average of the 8 values.

I'm sorry if this is an abundant question, but I've never been trained in these kind of statistics and I get completely lost searching the internet for an answer. I need to be sure that this measure is good for my data as I need to use it in my thesis.

Experiment 1, situation 1:

    Rater 1    Rater 2
Subj1 13         12

Subj2 85         71

Subj3 67         45
...   ...        ... 

Experiment 1, situation 2:

    Rater 1    Rater 2
Subj1 76         90

Subj2 62         51

Subj3 51         51
...   ...        ... 

Data looks like this with per experiment a number of subjects that are rated from 0 to 100 (= % freezing) in 3 (or 4) different situations by 2 raters (always the same raters). You can find an example excel file here (might be more clear): dropbox link

Thing is, if rater 1 says 76 and the other says 77, that's good, it's normal that it won't be perfectly the same value, but if it's close, it means that the raters kind of agreed. So continous data, the values go from 0 to 100 and are not categories or anything. If the value by the 2 raters is not exactly the same, that doesn't necessarily mean that there's no agreement: the more distant the values are from each other, means that we had less agreement, the closer they are, the more agreement the raters had.

Best Answer

If the categories are considered predefined (i.e. known before the experiment), you could probably use Cohen's Kappa or another chance-corrected agreement coefficient (e.g. Gwet's AC, Krippendorff's Alpha) and apply appropriate weights to account for partial agreement; see Gwet (2014). However, it seems like an ICC could be appropriate, too. I do not really understand what the "situations" are all about.

Gwet, K. L. (2014). Handbook of Inter-Rater Reliability. Gaithersburg, MD: Advanced Analytics, LLC.

Related Question