Solved – Expectation of Rand Index in Adjusted Rand Index

clustering

I try to understand the concept of Adjusted Rand Index. I understand that it is needed because RI does not reflect the affection of number of clusters. I think that, Expectation of RI is the main factor in ARI. But I cannot understand how it works. Explaininig this concept is available, but I failed to understand….

My first question is that how can we calculate Expectation of RI? In this case is $n_{ij}$ random variable? And do I calculate it by pdf of hyper geometric distribution??

Second questions are what the expectation means and how it works in ARI

Best Answer

The basic idea behind adjusting the Rand index for chance comes from a desire to answer the question: Does my value of RI indicate the clusterings are similar?

Say you compare two clusterings and get a value of $0.5$ (red line below). To determine if the Rand Index of $0.5$ is indeed a good score, it is assessed relative to the distribution of pairwise comparisons amongst a sample of 100 random from the Permutation model (blue, upper) with mean Rand index of $0.44$ (black, upper). The Permutation model fixes the number and sizes of clusters, and randomly exchanges all of the elements between those clusters. Now, according to the traditional Adjusted Rand Index, since our comparison of 0.5 is bigger than expected by chance 0.44, we performed well.

The problem is that there are many ways to define random clusterings as discussed in Gates & Ahn (2017). So if your original clustering was derived using, say, K-means clustering which fixes the number of clusters but allows their sizes to vary, the Permutation model is a poor choice of random clusters to compare against.

The distribution of pairwise comparisons amongst a sample of 100 random samples from the random model with a Fixed Number of Clusters (blue, lower) with a mean similarity of $0.59$ (black, lower), demonstrates that our value of 0.5 is less similar than if we had drawn a random clustering!

Related Solutions

Solved – Rand index calculation

I was pondering about the same, and I solved it like this. Suppose you have a co-occurrence matrix/contingency table where the rows are the ground truth clusters, and the columns are the clusters found by the clustering algorithm.

So, for the example in the book, it would look like:

  | 1 | 2 | 3
--+---+---+---
x | 5 | 1 | 2
--+---+---+---
o | 1 | 4 | 0
--+---+---+---
◊ | 0 | 1 | 3

Now, you can very easily compute the TP + FP by taking the sum per column and 'choose 2' over all those values. So the sums are [6, 6, 5] and you do '6 choose 2' + '6 choose 2' + '5 choose 2'.

Now, indeed, similarly, you can get TP + FN by taking the sum over the rows (so, that is [8, 5, 4] in the example above), apply 'choose 2' over all those values, and take the sum of that.

The TP's themselves can be calculated by applying 'choose 2' to every cell in the matrix and taking the sum of everything (assuming that '1 choose 2' is 0).

In fact, here is some Python code that does exactly that:

import numpy as np
from scipy.misc import comb

# There is a comb function for Python which does 'n choose k'                                                                                            
# only you can't apply it to an array right away                                                                                                         
# So here we vectorize it...                                                                                                                             
def myComb(a,b):
  return comb(a,b,exact=True)

vComb = np.vectorize(myComb)

def get_tp_fp_tn_fn(cooccurrence_matrix):
  tp_plus_fp = vComb(cooccurrence_matrix.sum(0, dtype=int),2).sum()
  tp_plus_fn = vComb(cooccurrence_matrix.sum(1, dtype=int),2).sum()
  tp = vComb(cooccurrence_matrix.astype(int), 2).sum()
  fp = tp_plus_fp - tp
  fn = tp_plus_fn - tp
  tn = comb(cooccurrence_matrix.sum(), 2) - tp - fp - fn

  return [tp, fp, tn, fn]

if __name__ == "__main__":
  # The co-occurrence matrix from example from                                                                                                           
  # An Introduction into Information Retrieval (Manning, Raghavan & Schutze, 2009)                                                                       
  # also available on:                                                                                                                                   
  # http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html                                                                     
  #                                                                                                                                                      
  cooccurrence_matrix = np.array([[ 5,  1,  2], [ 1,  4,  0], [ 0,  1,  3]])

  # Get the stats                                                                                                                                        
  tp, fp, tn, fn = get_tp_fp_tn_fn(cooccurrence_matrix)

  print "TP: %d, FP: %d, TN: %d, FN: %d" % (tp, fp, tn, fn)

  # Print the measures:                                                                                                                                  
  print "Rand index: %f" % (float(tp + tn) / (tp + fp + fn + tn))

  precision = float(tp) / (tp + fp)
  recall = float(tp) / (tp + fn)

  print "Precision : %f" % precision
  print "Recall    : %f" % recall
  print "F1        : %f" % ((2.0 * precision * recall) / (precision + recall))

If I run it I get:

$ python testCode.py
TP: 20, FP: 20, TN: 72, FN: 24
Rand index: 0.676471
Precision : 0.500000
Recall    : 0.454545
F1        : 0.476190

I actually didn't check any other examples than this one, so I hope I did it right.... ;-)

Solved – a “high” adjusted Rand index score

ARI only works for labeled data. I.e. where you already know the "true" solution.

So even if there was a rule such as "ARI > 0.8 is good", it does not help solve any problem. Because you always know a solution that has an ARI of 1.0.

Best Answer

Related Solutions

Solved – Rand index calculation

Solved – a “high” adjusted Rand index score

Related Question