Reliability – Can Cohen’s Kappa Be Used for Two Judgements Only?

information retrievalreliability

I am using Cohen's Kappa to calculate the inter-agreement between two judges.

It is calculated as:

$ \frac{P(A) – P(E)}{1 – P(E)} $

where $P(A)$ is the proportion of agreement and $P(E)$ the probability of agreement by chance.

Now for the following dataset, I get the expected results:

User A judgements: 
  - 1, true
  - 2, false
User B judgements: 
  - 1, false
  - 2, false
Proportion agreed: 0.5
Agreement by chance: 0.625
Kappa for User A and B: -0.3333333333333333

We can see that both judges have not agreed very well. However in the following case where both judges evaluate one criteria, kappa evaluates to zero:

User A judgements: 
  - 1, false
User B judgements: 
  - 1, false
Proportion agreed: 1.0
Agreement by chance: 1.0
Kappa for User A and B: 0

Now I can see that the agreement by chance is obviously 1, which leads to kappa being zero, but does this count as a reliable result? The problem is that I normally don't have more than two judgements per criteria, so these will all never evaluate to any kappa greater than 0, which I think is not very representative.

Am I right with my calculations? Can I use a different method to calculate inter-agreement?

Here we can see that kappa works fine for multiple judgements:

User A judgements: 
  - 1, false
  - 2, true
  - 3, false
  - 4, false
  - 5, true
User A judgements: 
  - 1, true
  - 2, true
  - 3, false
  - 4, true
  - 5, false
Proportion agreed: 0.4
Agreement by chance: 0.5
Kappa for User A and B: -0.19999999999999996

Best Answer

The "chance correction" in Cohen's $\kappa$ estimates probabilities with which each rater chooses the existing categories. The estimation comes from the marginal frequencies of the categories. When you only have 1 judgement for each rater, this means that $\kappa$ assumes the category chosen for this single judgement in general has a probability of 1. This obviously makes no sense since the number of judgements (1) is too small to reliably estimate the base rates of all categories.

An alternative might be a simple binomial model: without additional information, we might assume that the probability of agreement between two raters for one judgement is 0.5 since judgements are binary. This means that we implicitly assume that both raters pick each category with probability 0.5 for all criteria. The number of agreements expected by chance over all criteria then follows a binomial distribution with $p=0.5$.

Related Question