Solved – How should I interpret GAP statistic

clustering

I used GAP statistic to estimate k clusters in R. However I'm not sure if I interpret it well. enter image description here

From the plot above I assume that I should use 3 clusters.

enter image description here

From the second plot I should choose 6 clusters. Is it correct interpretation of GAP statistic?

I would be grateful for any explanation.

Best Answer

To obtain an ideal clustering, you should select $k$ such that you maximize the gap statistic. Here's the exemple given by Tibshirani et al. (2001) in their paper, the plot formed by artificial data with 2 clusters. As you can see, 2 is clearly the ideal $k$, because the gap statistic is maximized at $k=2$:

Gap statistic

However, in many real-world datasets, the clusters are not as well-defined, and we want to be able to balance maximizing the gap statistic with parsimony of the model. Case in point: OP's first image. If we're maximizing the gap statistic alone, then we should choose the model with 30 (or even more!) clusters. Assuming that that plot is just going to continue to increase, of course, the results are less useful. So Tibshirani suggests the 1-standard-error method:

Choose the cluster size $\hat{k}$ to be the smallest $k$ such that $\text{Gap}(k) \geq \text{Gap}(k + 1) - s_{k + 1}$.

Which informally is identifying the point at which the rate of increase of the gap statistic begins to "slow down".

So, in OP's first image, if we take the red error bars to be standard error, then 3 is the smallest $k$ that satisfies this criterion:

Annotated image 1

For OP's second image, however, you'll see that the gap statistic immediately decreases for $k > 1$. So the first $k$ that satisfies the 1-standard-error criterion is $1$. This is the plot's way of saying that the data should not be clustered.

As it turns out, there are additional ways to choose optimal $k$. The default method of the R function clusGap, for example, always searches for the local maximum of the graph, and selects the smallest $k$ within one standard error of the local max. Using this firstSEmax method, we would select $k = 30$ and $k = 19$ for OP's graphs 1 and 2 respectively. As I said, however, this seems to suffer from a complexity issue.

Source: Robert Tibshirani, Guenther Walther, and Trevor Hastie (2001). Estimating the number of clusters in a data set via the gap statistic.

Related Question