Solved – Why is Cohen’s kappa low despite high observed agreement

agreement-statisticscohens-kappareliability

I have about 10 raters giving weighted ratings to different items of a dataset. The number of raters may differ from item to another. An item may have any number of raters between 2 and 10. Raters can give each item one of 3 values (-1,0, and 1).

I tried using weighted kappa to measure the agreement between each pair of raters at a time. However, the results I am getting seem a little strange. When only 2 of the 3 values are used by the raters the result is 0 even if they agree on most items. Example:

(-1,-1; -1 ,-1;-1 , -1; -1, -1;-1 , 0; -1, 0)

even though the two raters gave the same rating to 4/6 items the result is 0.

Is cohen kappa the best suited method for my problem? If not, what's a better alternative.

Best Answer

In your example, Cohen's $\kappa$ coefficient is equal to $0$ despite observed agreement ($p_o$) being relatively high because chance agreement ($p_c$) is also high according to Cohen's assumptions.

$$ \kappa = \frac{p_o - p_c}{1 - p_c} = \frac{.667 - .667}{1 - .667} = .000 $$

You might try another chance-adjusted index that makes different assumptions than Cohen's $\kappa$ coefficient. One such option would be Bennett et al.'s $S$ score below, where $q$ is the number of possible categories. In this example, assuming the same three category options, $S$ would be higher.

$$ S = \frac{p_o - 1/q}{1 - 1/q} = \frac{.667 - .333}{1 - .333} = .500 $$

Both of these metrics (and several others) can be adapted to multiple raters, multiple categories, missing data, and weighting schemes for non-nominal categories. See my mReliability website.