Solved – Methods for comparing clustering results

clusteringrunsupervised learningweka

I am doing an unsupervised clustering analysis for a genomics project. This means that I do not know when a particular clustering analysis is good or not.

I am running different clustering algorithms and different 'sets of features'. What I mean with different 'sets of features' is that given a data frame, I choose different combination of columns depending on its biological importance. For instance, some variables measure things at the sequence level, while others are measuring a particular cellular process or some other feature that cannot be measured at the sequence level. I am playing around with the different outputs of these sets of features, running the algorithms with all the features, or ignoring some, etc .

What I want is to compare the different clusters of these different runs and see if some of my objects are being clustered similarly despite lacking some sets of features. Does this make sense?

Is there any recommendation on how can I do this?

Best Answer

You can use the Adjusted Rand Index or the Adjusted Mutual Information to measure the similarity (agreement) of the overall results of two clustering algorithms on an overlapping dataset.

Both scores are adjusted for chance which means that 2 random clusterings will likely have an ARI or AMI close to 0.0.

Furthermore you can use those measure for model selection (e.g. finding the number of k in k-means) by running the clustering algorithm twice on 2 overlapping samples of the datasets and measuring the agreement on the overlap. The assumption is that a high agreement on the overlap means a higher stability of the algorithm and hence a better value for k (it captures better the real structure of the dataset).

A Novel Approach for Automatic Number of Clusters Detection in Microarray Data based on Consensus Clustering by Nguyen and Epps is probably the best reference for this method and it is further applied to microarray data.

Related Question