Inter-Rater Reliability – Assessing Reliability for Multiple Categorical Variables and Two Raters

agreement-statistics

We want to know the Inter-rater reliability for multiple variables. We are two raters. The variables are all categorial. This is just an example:

variablename    possible values
 sex             m, f
 jobtype         parttime, fulltime, other
 city            0,1,2,3,4,..,43 (there is a codenumber for each city)

The two raters extracted the data from "difficult" sources. There is a possiblity of errors and misstakes in that process.

First we thought about "Cohans Kappa". But we are not sure how to use this with multiple variables and if it is the right solution for our needs. Maybe there is another statistical methode better fitting to our needs?

The point is we don't want to give Kappa for each variable. We want one Kappa (or something else) for the complete process (two raters, multiple variables).

Best Answer

Despite your question asking for a single measure that summarizes agreement across all variables, I would recommend against this. You should use a generalized measure of agreement that accommodates multiple categories and different weights (in case your categories are ordered) and apply this to each variable separately. I would recommend the generalized kappa, pi, or S coefficients (link to more information). You could then average these scores across all variables to get a single number, but isn't it more useful to know how well the raters did for each variable? Suppose your raters had high agreement for one variable and low agreement for another. It would then appear that they did moderately on average. But this isn't exactly right. The data for one variable will be very reliable while the data for the other will be less so. This is important information.

Related Question