Solved – The optimal number of cluster by Gap Statistics

clusteringr

I'm using the GAP statistics (clusGAP) to find the optimal number of clusters in my gene expression data. But I'm not sure whether the optimal number suggested by clusGAP is right or not.
I ran the clusGAP for several times (clustGAP(data, kmeans, K.max = 30, B = 100)),
but I received different results as follow:
Suggested number of cluster is 11

The suggested number of cluster is 11 for above figure ("firstSEmax");

For another trial,
Suggested number of cluster is 11

The suggested number of cluster is 7 for above figure ("firstSEmax");

So I have two questions here:

  1. I thought in these results, it is hard to find the maximum of Gap value compared to the examples in the original paper. I'm new in this field, so I don't know whether I can believe the optimal number of clusters suggested by the clustGAP?

  2. As I mentioned, the optimal number is different in different trials, then the question is which one to choose? or how can I get a consistent result?

Best Answer

I do not trust the Gap statistic, or any of these heuristics. In particularly not if the plots are as smooth as these.

Please try to visualize your data. Make sure you have preprocessed it well enough. In all the similar questions here (try searching for "gap statistic") I had the impression that either the data was not well preprocessed, or just doesn't contain clusters at all...

This answer: https://stats.stackexchange.com/a/140723/7828 is a great example on why I (A) do not trust the gap statistic, (B) preprocessing is really really really important, (C) visualization is helpful.

Really visualizing your data is worth all the effort. In particular, also visualize the "best" result. Does it look good to you, or anomalous? Instead of trusting on some debated statistic like the Gap statistic, it is much better to trust a good visualization.