Clustering – How to Calculate the Adjusted Rand Index

clustering

I'm really close to understanding the adjusted rand index, but I lack a background in formal maths and I'm struggling to grasp one or two things.

I've been using the Wikipedia page primarily. I've calculated the rand index for some pretend data.

The only part I'm struggling with is calculating nij, ai and bj. Am I adding every occurrence of them together? Or a singular row? A singular row wouldn't give the adjusted rand index would it?

Say I have two sets with two pairs in common between them, the total value of bj is going to be two, the total value of ai is also going to be two and nij is also two. This doesn't seem right for them all to be the same value?

Here's an example contingency table.

     x1    x2    x3
y1   0     0     0
y2   0     1     0
y3   0     1     0

I get that I sum these up to make ai or bj, but I think I must be calculating it wrong?

Best Answer

Basically:

n_ij is across the diagonal (i.e., when i = j)
a_i is the row sums
b_j is the column sums

Using wikipedia, we have the formula:

$ARI = \frac{ \sum_{ij} \binom{n_{ij}}{2} - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{n}{2} }{ \frac{1}{2} [\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}] - [\sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2}] / \binom{n}{2} }$

Let's assume we have the table:

       x1    x2    x3    Sums
y1     1     1     0      2
y2     1     2     1      4
y3     0     0     4      4
Sums   2     3     5

Breaking into components:

$\sum_{ij} \binom{n_{ij}}{2} = \binom{1}{2} + \binom{2}{2} + \binom{4}{2} = 7$
$\sum_i \binom{a_i}{2} = \binom{2}{2} + \binom{4}{2} + \binom{4}{2} = 13$
$\sum_j \binom{b_j}{2} = \binom{2}{2} + \binom{3}{2} + \binom{5}{2} = 14$

So, then

$ARI = \frac{7 - 13*14/45}{(13 + 14)/2 - 13*14/45} = 0.313$

Confirming this result in R:

library(cluster)

x <- c(1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
y <- c(1, 2, 1, 2, 2, 3, 3, 3, 3, 3)
adjustedRandIndex(x, y) # .313

Related Solutions

Classification – How to Intuitively Interpret Indices/Metrics for Comparing Partitions

The split/join metric measures the number of 'moves' required to go from the first clustering to the second clustering, where each 'move' consists of splitting off a single element off of one cluster and then either attaching it to another cluster (which also counts as a move) or starting a new cluster. There is a further requirement, which is not very important for the intuition, that these moves are 'aligned' with the lattice of partitions. This means that the path sketched out by the 'moves' also contains the largest common subclustering of the two clusterings. In your case, a single node or element is, I assume, a single cell. The intuition thus is, that for the left instance, eight cells need to be rearranged in order to obtain one of the clusterings from the other. The question below and its answers may also be interesting: Comparing clusterings: Rand Index vs Variation of Information

Solved – Rand index calculation

I was pondering about the same, and I solved it like this. Suppose you have a co-occurrence matrix/contingency table where the rows are the ground truth clusters, and the columns are the clusters found by the clustering algorithm.

So, for the example in the book, it would look like:

  | 1 | 2 | 3
--+---+---+---
x | 5 | 1 | 2
--+---+---+---
o | 1 | 4 | 0
--+---+---+---
◊ | 0 | 1 | 3

Now, you can very easily compute the TP + FP by taking the sum per column and 'choose 2' over all those values. So the sums are [6, 6, 5] and you do '6 choose 2' + '6 choose 2' + '5 choose 2'.

Now, indeed, similarly, you can get TP + FN by taking the sum over the rows (so, that is [8, 5, 4] in the example above), apply 'choose 2' over all those values, and take the sum of that.

The TP's themselves can be calculated by applying 'choose 2' to every cell in the matrix and taking the sum of everything (assuming that '1 choose 2' is 0).

In fact, here is some Python code that does exactly that:

import numpy as np
from scipy.misc import comb

# There is a comb function for Python which does 'n choose k'                                                                                            
# only you can't apply it to an array right away                                                                                                         
# So here we vectorize it...                                                                                                                             
def myComb(a,b):
  return comb(a,b,exact=True)

vComb = np.vectorize(myComb)

def get_tp_fp_tn_fn(cooccurrence_matrix):
  tp_plus_fp = vComb(cooccurrence_matrix.sum(0, dtype=int),2).sum()
  tp_plus_fn = vComb(cooccurrence_matrix.sum(1, dtype=int),2).sum()
  tp = vComb(cooccurrence_matrix.astype(int), 2).sum()
  fp = tp_plus_fp - tp
  fn = tp_plus_fn - tp
  tn = comb(cooccurrence_matrix.sum(), 2) - tp - fp - fn

  return [tp, fp, tn, fn]

if __name__ == "__main__":
  # The co-occurrence matrix from example from                                                                                                           
  # An Introduction into Information Retrieval (Manning, Raghavan & Schutze, 2009)                                                                       
  # also available on:                                                                                                                                   
  # http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html                                                                     
  #                                                                                                                                                      
  cooccurrence_matrix = np.array([[ 5,  1,  2], [ 1,  4,  0], [ 0,  1,  3]])

  # Get the stats                                                                                                                                        
  tp, fp, tn, fn = get_tp_fp_tn_fn(cooccurrence_matrix)

  print "TP: %d, FP: %d, TN: %d, FN: %d" % (tp, fp, tn, fn)

  # Print the measures:                                                                                                                                  
  print "Rand index: %f" % (float(tp + tn) / (tp + fp + fn + tn))

  precision = float(tp) / (tp + fp)
  recall = float(tp) / (tp + fn)

  print "Precision : %f" % precision
  print "Recall    : %f" % recall
  print "F1        : %f" % ((2.0 * precision * recall) / (precision + recall))

If I run it I get:

$ python testCode.py
TP: 20, FP: 20, TN: 72, FN: 24
Rand index: 0.676471
Precision : 0.500000
Recall    : 0.454545
F1        : 0.476190

I actually didn't check any other examples than this one, so I hope I did it right.... ;-)

Best Answer

Related Solutions

Classification – How to Intuitively Interpret Indices/Metrics for Comparing Partitions

Solved – Rand index calculation

Related Question