Solved – Fisher’s exact test vs kappa analysis

agreement-statisticsassociation-measurecohens-kappacontingency tablesfishers-exact-test

I was reading a paper where the authors assessed the association between two different diagnostics tests intended to diagnose the same disease and they performed the analysis with Fisher's exact test.

While I find this statistically appropriate I began to wounder if they could have used Chohen's Kappa.

After a bit of reading I can find no recommendations of when to use either analysis. Both are tabular analysis and I know while Cohen's is used to measure "agreement", I think it is mostly measuring the same thing as Fisher's?

Am I correct in this?
Can anyone give me any guidelines or insight for when one test is more appropriate than the other?

Best Answer

I know I answer the question two years later, but I hope some future readers may find the answer helpful.

Cohen's $\kappa$ tests if there are more chance that a datum falls in the diagonal of a classification table whereas Fisher's exact test evaluates the association between two categorical variables.

In some cases, Cohen's $\kappa$ might appear to converge to Fisher exact test. A simple case, will answer your question that the Fisher test is not appropriate for rater agreement.

Imagine a $2 \times 2$ matrix like

$\begin{matrix} 10 & 20 \\ 20 & 10\end{matrix}$.

It is clear that there is an association between both variables on the off-diagonal, but that raters do not agree more than chance. In other terms, raters systematicaly disagree. From the matrix, we should expect that the Fisher test is significant while the Cohen's $\kappa$ should not be. Carrying the analysis confirms the expectation, $p = 0.01938$ and $\kappa = -0.333$, $z =-4743$ and $p = 0.999$.

We can also carry another example where both outcomes diverge with the following matrix :

$\begin{matrix} 20 & 10 & 10 \\ 20 & 20 & 20 \\ 20 & 20 & 20 \end{matrix}$,

which gives $p = 0.4991$ and $\kappa = 0.0697$, $z =1.722$ and $p = 0.043$. So the raters likely agree, but there is no relation between categorical variables.

I don't have a more formal mathematical explanation on how they should or should not converge though.

Finally, given the actual state of knowledge on Cohen's $\kappa$ in the methodological literature (see this for instance), you might want to avoid it as a measure of agreement. The coefficient has a lot of issus. Careful training of raters and strong agreement on each categories (rather than the overall agreement) is, I believe, the way to go.

Related Solutions

Solved – Fisher’s exact test in 3×2 contingency table

It sounds like you are asking a lot of different questions here.

My question is: how should I interpret the p value? I don't understand what is that referred to.

The null hypothesis for Fisher's Exact test is that the groups do not affect the outcome, i.e. that they are independent. Rejection of the null hypothesis indicates the outcome (a, b, or c) is dependent on group.

fisher.test(matrix(c(2, 12, 1, 5, 3, 1), 
            nrow=2, ncol=3, byrow=TRUE))
Fisher's Exact Test for Count Data

data:  dta
p-value = 0.05082
alternative hypothesis: two.sided

In this case your $p$ value is approximately 0.05082. I will let you decide whether to reject the null.

Having the p value, how can I say that one of the three forms is statistically significant more represented than the others (if true)?

This is a separate question and I'm not sure what you are trying to ask.

Solved – Which measure of inter-rater agreement for multi-class rating

You calculate a single Kappa for all the categories at once. The formula for $\kappa$ is:

$$\kappa = \frac{p_{o}-p_{e}}{1-p_{e}} = \frac{N_{o}-N_{e}}{N-N_{e}}$$

You see, this depends only on $p_{o}$ and $p_{e}$, respectively the observed agreement and the chance agreement. The number of categories does not matter at all.

Say you have a table of observations $\text m_{i,j}$ with $k$ categories.

$\sum_i^k{m_{i,i}}$ is the sum of agreements in the table.
$\sum_j^k{m_{l,j}}$ is the sum of counts in the $l$-th row.
$\sum_i^k{m_{i,l}}$ is the sum of counts in the $l$-th column.

So, when you multiply the total counts in a row by the total counts in the respective column and divide by $N$ you have a conservative estimate of the count of agreements by chance in that category. Summing over all pairs of rows and columns of each category you have an overall estimate of chance agreement, and you can plug that into the $\kappa$ formula.

$$N_o = \sum_i^k{m_{i,i}}$$

$$N_e = \frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)$$

$$\therefore \kappa = \frac{N_{o}-N_{e}}{N-N_{e}} = \frac{\sum_i^k{m_{i,i}}-\frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)}{N-\frac{1}{N}\cdot\sum_l^k\left(\sum_j^k{m_{l,j}} \cdot \sum_i^k{m_{i,l}}\right)}$$

And here's a sample code in R:

kappa = function(M) (sum(diag(M)) - 1/sum(M) * sum(colSums(M) * rowSums(M)))/
(sum(M) - 1/sum(M) * sum(colSums(M) * rowSums(M)))

In case of perfect agreement:

> M = table(iris$Species, iris$Species)
> print(kappa(M))
[1] 1

In case of random predictions

> M = table(sample(iris$Species, 150L), iris$Species)
> print(kappa(M))
[1] -0.09

Best Answer

Related Solutions

Solved – Fisher’s exact test in 3×2 contingency table

Solved – Which measure of inter-rater agreement for multi-class rating

Related Question