Solved – Fisher’s exact test vs kappa analysis

agreement-statisticsassociation-measurecohens-kappacontingency tablesfishers-exact-test

I was reading a paper where the authors assessed the association between two different diagnostics tests intended to diagnose the same disease and they performed the analysis with Fisher's exact test.

While I find this statistically appropriate I began to wounder if they could have used Chohen's Kappa.

After a bit of reading I can find no recommendations of when to use either analysis. Both are tabular analysis and I know while Cohen's is used to measure "agreement", I think it is mostly measuring the same thing as Fisher's?

  • Am I correct in this?

  • Can anyone give me any guidelines or insight for when one test is more appropriate than the other?

Best Answer

I know I answer the question two years later, but I hope some future readers may find the answer helpful.

Cohen's $\kappa$ tests if there are more chance that a datum falls in the diagonal of a classification table whereas Fisher's exact test evaluates the association between two categorical variables.

In some cases, Cohen's $\kappa$ might appear to converge to Fisher exact test. A simple case, will answer your question that the Fisher test is not appropriate for rater agreement.

Imagine a $2 \times 2$ matrix like

$\begin{matrix} 10 & 20 \\ 20 & 10\end{matrix}$.

It is clear that there is an association between both variables on the off-diagonal, but that raters do not agree more than chance. In other terms, raters systematicaly disagree. From the matrix, we should expect that the Fisher test is significant while the Cohen's $\kappa$ should not be. Carrying the analysis confirms the expectation, $p = 0.01938$ and $\kappa = -0.333$, $z =-4743$ and $p = 0.999$.

We can also carry another example where both outcomes diverge with the following matrix :

$\begin{matrix} 20 & 10 & 10 \\ 20 & 20 & 20 \\ 20 & 20 & 20 \end{matrix}$,

which gives $p = 0.4991$ and $\kappa = 0.0697$, $z =1.722$ and $p = 0.043$. So the raters likely agree, but there is no relation between categorical variables.

I don't have a more formal mathematical explanation on how they should or should not converge though.

Finally, given the actual state of knowledge on Cohen's $\kappa$ in the methodological literature (see this for instance), you might want to avoid it as a measure of agreement. The coefficient has a lot of issus. Careful training of raters and strong agreement on each categories (rather than the overall agreement) is, I believe, the way to go.

Related Question