Solved – Rand index calculation

clustering

I'm trying to figure out how to calculate the Rand Index of a cluster algorithm, but I'm stuck at the point how to calculate the true and false negatives.

At the moment I'm using the example from the book An Introduction into Information Retrieval (Manning, Raghavan & Schütze, 2009). At page 359 they talk about how to calculate the Rand index. For this example they use three clusters and the clusters contains the following objects.

  1. a a a a a b
  2. a b b b b c
  3. a a c c c

I replace the object (orginal signs to letters, but the idea and count stay the same). I'll give the exact words from the book in order to see what they are talking about:

We first compute TP +FP. The three clusters contain 6, 6, and 5 points, respectively, so the total number of “positives” or pairs of documents that are in the same cluster is:

TP + FP = ${6 \choose 2}$ + ${6 \choose 2}$ + ${5 \choose 2}$ = 15 + 15+ 10 = 40

Of these, the a pairs in cluster 1, the b pairs in cluster 2, the c pairs in cluster 3,
and the a pair in cluster 3 are true positives:

TP = ${5 \choose 2}$ + ${4 \choose 2}$ + ${3 \choose 2}$ + ${2 \choose 2}$ = 10 + 6 + 3 + 1 = 20

Thus, FP = 40 − 20 = 20.

Till here there calculations are clear, and if I take other examples I get the same results, but when I want to calculate the false negative and true negative Manning et al. state the following:

FN and TN are computed similarly, resulting in the following contingency table:

The contingency table looks as follows:

+--------+--------+
| TP: 20 | FN: 24 |
+--------+--------+
| FP: 20 | TN: 72 |
+--------+--------+

The sentence: "FN and TN are computed similarly" is not clear to my and I don't understand which numbers I need to calculate the TN and FN. I can calculate the right side of the table by doing the following:

TP + FP + FN + TN = ${n \choose 2}$ = ${17 \choose 2}$ = 136

Source: http://en.wikipedia.org/wiki/Rand_index

Thus, FN + TN = 136 – TP + FP = 136 – 40 = 96, but this doesn't really help my in figuring out how to calculate the variables separately. Especially when the authors say: "FN and TN are computed similarly". I don't see how. Also when I look at other examples they calculate each cell of the contingency table by looking at each pair.

For example: http://www.otlet-institute.org/wikics/Clustering_Problems.html#toc-Subsection-4.1

My first question, based on the example of Manning et al (2009), is it possible to calculate the TN and FN if you only know the TP & NP? And if so, how does the similar calculation looks like based of the given example?

Best Answer

I was pondering about the same, and I solved it like this. Suppose you have a co-occurrence matrix/contingency table where the rows are the ground truth clusters, and the columns are the clusters found by the clustering algorithm.

So, for the example in the book, it would look like:

  | 1 | 2 | 3
--+---+---+---
x | 5 | 1 | 2
--+---+---+---
o | 1 | 4 | 0
--+---+---+---
◊ | 0 | 1 | 3

Now, you can very easily compute the TP + FP by taking the sum per column and 'choose 2' over all those values. So the sums are [6, 6, 5] and you do '6 choose 2' + '6 choose 2' + '5 choose 2'.

Now, indeed, similarly, you can get TP + FN by taking the sum over the rows (so, that is [8, 5, 4] in the example above), apply 'choose 2' over all those values, and take the sum of that.

The TP's themselves can be calculated by applying 'choose 2' to every cell in the matrix and taking the sum of everything (assuming that '1 choose 2' is 0).

In fact, here is some Python code that does exactly that:

import numpy as np
from scipy.misc import comb

# There is a comb function for Python which does 'n choose k'                                                                                            
# only you can't apply it to an array right away                                                                                                         
# So here we vectorize it...                                                                                                                             
def myComb(a,b):
  return comb(a,b,exact=True)

vComb = np.vectorize(myComb)

def get_tp_fp_tn_fn(cooccurrence_matrix):
  tp_plus_fp = vComb(cooccurrence_matrix.sum(0, dtype=int),2).sum()
  tp_plus_fn = vComb(cooccurrence_matrix.sum(1, dtype=int),2).sum()
  tp = vComb(cooccurrence_matrix.astype(int), 2).sum()
  fp = tp_plus_fp - tp
  fn = tp_plus_fn - tp
  tn = comb(cooccurrence_matrix.sum(), 2) - tp - fp - fn

  return [tp, fp, tn, fn]

if __name__ == "__main__":
  # The co-occurrence matrix from example from                                                                                                           
  # An Introduction into Information Retrieval (Manning, Raghavan & Schutze, 2009)                                                                       
  # also available on:                                                                                                                                   
  # http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html                                                                     
  #                                                                                                                                                      
  cooccurrence_matrix = np.array([[ 5,  1,  2], [ 1,  4,  0], [ 0,  1,  3]])

  # Get the stats                                                                                                                                        
  tp, fp, tn, fn = get_tp_fp_tn_fn(cooccurrence_matrix)

  print "TP: %d, FP: %d, TN: %d, FN: %d" % (tp, fp, tn, fn)

  # Print the measures:                                                                                                                                  
  print "Rand index: %f" % (float(tp + tn) / (tp + fp + fn + tn))

  precision = float(tp) / (tp + fp)
  recall = float(tp) / (tp + fn)

  print "Precision : %f" % precision
  print "Recall    : %f" % recall
  print "F1        : %f" % ((2.0 * precision * recall) / (precision + recall))

If I run it I get:

$ python testCode.py
TP: 20, FP: 20, TN: 72, FN: 24
Rand index: 0.676471
Precision : 0.500000
Recall    : 0.454545
F1        : 0.476190

I actually didn't check any other examples than this one, so I hope I did it right.... ;-)

Related Question