You can use the Adjusted Rand Index or the Adjusted Mutual Information to measure the similarity (agreement) of the overall results of two clustering algorithms on an overlapping dataset.
Both scores are adjusted for chance which means that 2 random clusterings will likely have an ARI or AMI close to 0.0.
Furthermore you can use those measure for model selection (e.g. finding the number of k in k-means) by running the clustering algorithm twice on 2 overlapping samples of the datasets and measuring the agreement on the overlap. The assumption is that a high agreement on the overlap means a higher stability of the algorithm and hence a better value for k (it captures better the real structure of the dataset).
A Novel Approach for Automatic Number of Clusters Detection in
Microarray Data based on Consensus Clustering by Nguyen and Epps is probably the best reference for this method and it is further applied to microarray data.
You are using the kmeans
function, which will not give the same exact results every time you run it.
The k-means algorithm works by using randomly chosen centroids as a starting point. These are generated using R pseudorandom number generator (PRNG).
The PRNG generates a series of random values which depend on a seed.
From ?set.seed
:
Initially, there is no seed; a new one is created from the current time (and since R 2.14.0, the process ID) when one is required. Hence different sessions will give different simulation results, by default. However, the seed might be restored from a previous session if a previously saved workspace is restored.
If you want to always obtain the same results you should impose a seed at the start of your script.
For instance:
set.seed(12345)
Different seeds will give different results, but once you have fixed it it will always be the same.
Now, the fact that:
They are not different in everything, but there are individual that they now belong to another cluster!
Is a good thing, it means that you can cluster most individuals with good confidence. Probably the ones that change between clusters are a bit "borderline".
One thing that you should do, however, is to set the nstart
parameter in kmeans
. Setting nstart
to 10, for instance, will make the algorithm run 10 times, with 10 different starting sets of points and return the best fit (the one with the minimum within cluster sums-of-squares).
This will help in reducing "bad clustering" due to an "unlucky" choice of starting points.
Finally, I am not completely sure what is the point of running hclust
on the kmeans
results. Either run hclust
directly on the original data, or just show the kmeans
results.
Best Answer
To compare the similarity of two hierarchical (tree-like) structures, measures based on cophenetic correlation idea are used. But is it correct to perform comparison of dendrograms in order to select the "right" method or distance measure in hierarchical clustering?
There are some points - hidden snags - regarding hierarchical cluster analysis that I would hold quite important:
If after the above precautions you continue to think that you want a measure of similarity between hierarchical classifications you might google on 'comparing of dendrograms' and 'comparing hierarchical classifications'. One most suggesting itself idea may be based on the cophenetic correlation: having two dendrograms for the same dataset of n objects, let $X_{ij}$ be coefficient of colligation (or maybe its rank, the step number) between every pair of objects ij in one dendrogram, and $Y_{ij}$ likewise be the same in the other dendrogram. Compute correlation or cosine.
$^1$ Later update on the problem of dendrogram of Wards's method. Different clustering programs may output differently transformed aglomeration coefficients for Ward's method. Hence their dendrograms will look somewhat differently despite that the clustering history and results are the same. For example, SPSS doesn't take the root from the ultrametric coefficients, and it cumulates them in the output. Another tradition (found in some R packages, for example) is to take the root (so called "Ward-2" implementations) and not to cumulate. To repeat again, such differences affect only the general shape/looks of the dendrogram, not the clustering results. But the looks of the dendrogram might influence your decision about the number of clusters. The moral is that it would be safe not to rely on dendrogram in Ward's method at all, unless you know exactly what are these coefficients out of your program and how to interpret them correctly.