Solved – Clustering – Use ARI to compare different clustering

clusteringk-means

I have a data set of 54000 genes and I used different methods for clustering such as HAC, K-means, model based clustering and CLARA. The objective is to compare these methods. I used the Adjusted Rand Index. But there is something that I do not understand.

With my data set, the ARI value between a clustering result obtained by K-means and another clustering result, also obtained by K-means, with the same number of clusters (i.e I effectuated K-means two times) , is only 0.40, which is not a high value.

My question is, if the ARI value is not high for the same method compare to itself, can we use ARI to compare the clustering results for different method? And is there other index or method to compare them? I already read the topic How to select a clustering method? How to validate a cluster solution (to warrant the method choice)? but I still do not understand which methods are used to compare the clustering results.

Best Answer

  1. You are making a fallacy when saying if the ARI value is not high for the same method compare to itself, can we use ARI to compare the clustering results for different method. Cluster analysis results, most methods including K-means, are much dependent on its input "tuning" parameters (for K-means these are initial center seeds), and on data preprocessing. Your two runnings of K-means - which results you are comparing - differed, I suppose, in some of this respect (which, by the way? you haven't expressed it). Why do you expect the results must be very similar? They have not to. Especially if there is hardly any cluster structure in the data or the number of clusters was wrong. There is no reason to think, a priori and generally, that the difference in results between two runnings of the same method under different parameters ought to be less than of between two different methods.

  2. ARI's baseline (value 0) is not the absence of matching (similarity in results) but the level of chance matching. So value $0.40$ is not a low value, it is medium size value, I would say. But what is unadjusted Rand value, did you check? It will be higher.

  3. There are many "external clustering criteria" besides Rand or Adjusted Rand. See some of their formulae in the description of !cluagree SPSS macro of mine on my web page (download collection named "Compare partitions" there).

Related Question