Solved – Is this big difference meaningful? 63% Rand index, but 0,004 Adjusted Rand Index

clusteringdata mininghierarchical clusteringr

I have a data set n=175 and for 2 different clustering (A and B) I have 5 and 6 clusters. The table for similarity of clusterings is below. First I calculated the Rand Index both manually with Excel and with "cluster_similarity" function in R and I got 63,4%.
Than I calculated the Adjusted Rand index both with Excel and "adjustedRandIndex" function in R. I got 0,003 even not %3. Why is this big difference? I am very confused, I was planning to use Rand Index for my paper work but I am afraid if I have to use the adjsuted one. There are some zeros and ones in the table, may be those are problem.

n=175 for both clustering A and clustering B

Best Answer

Always use the adjusted rand index. There is no reason to use the non-adjusted version.

Assuming you have a data set of 100 objects. 90 are type A. 10 are type B in the first clustering. For the second clustering, pick 90 random objects, and label them A, and the remaining 10 B. A typical confusion matrix will look like this:

81 19
19 1

and have a Rand index of somewhere around 0.95 - this looks pretty good. But the labels were given randomly, it must not be good! The adjusted rand index of this solution should be close to 0.

Thus:

  1. A high Rand index may be due to label distribution. A value of 0.95 can still be random!
  2. Adjusted rand values near 0 do indicate random results; values less than 0 even worse-than-guessing.
  3. Always prefer adjusted Rand to regular Rand index!

In the example of your question, the clusterings are as similar as random labels.