Clustering – How to Calculate the Adjusted Rand Index

clustering

I'm really close to understanding the adjusted rand index, but I lack a background in formal maths and I'm struggling to grasp one or two things.

I've been using the Wikipedia page primarily. I've calculated the rand index for some pretend data.

The only part I'm struggling with is calculating nij, ai and bj. Am I adding every occurrence of them together? Or a singular row? A singular row wouldn't give the adjusted rand index would it?

Say I have two sets with two pairs in common between them, the total value of bj is going to be two, the total value of ai is also going to be two and nij is also two. This doesn't seem right for them all to be the same value?

Here's an example contingency table.

     x1    x2    x3
y1   0     0     0
y2   0     1     0
y3   0     1     0

I get that I sum these up to make ai or bj, but I think I must be calculating it wrong?

Best Answer

Basically:

  • nij is across the diagonal (i.e., when i = j)
  • ai is the row sums
  • bj is the column sums

Using wikipedia, we have the formula:

$ARI = \frac{ \sum_{ij} \binom{n_{ij}}{2} - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{n}{2} }{ \frac{1}{2} [\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}] - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{n}{2} }$

Let's assume we have the table:

       x1    x2    x3    Sums
y1     1     1     0      2
y2     1     2     1      4
y3     0     0     4      4
Sums   2     3     5

Breaking into components:

  • $\sum_{ij} \binom{n_{ij}}{2} = \binom{1}{2} + \binom{2}{2} + \binom{4}{2} = 7$
  • $\sum_i \binom{a_i}{2} = \binom{2}{2} + \binom{4}{2} + \binom{4}{2} = 13$
  • $\sum_j \binom{b_j}{2} = \binom{2}{2} + \binom{3}{2} + \binom{5}{2} = 14$

So, then

$ARI = \frac{7 - 13*14/45}{(13 + 14)/2 - 13*14/45} = 0.313$

Confirming this result in R:

library(cluster)

x <- c(1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
y <- c(1, 2, 1, 2, 2, 3, 3, 3, 3, 3)
adjustedRandIndex(x, y) # .313