R – Understanding Gap Statistic for K-Means Clustering

clusteringk-meansmachine learningr

I am using K-means to cluster my data and was looking for a way to suggest an "optimal" cluster number. Gap statistics seems to be a common way to find a good cluster number.

For some reason it returns 1 as optimal cluster number, but when I look at the data it's obvious that there are 2 clusters:

![1](http://i60.tinypic.com/28bdy6u.jpg)

This is how I call gap in R:

gap <- clusGap(data, FUN=kmeans, K.max=10, B=500)
with(gap, maxSE(Tab[,"gap"], Tab[,"SE.sim"], method="firstSEmax"))

The result set:

> Number of clusters (method 'firstSEmax', SE.factor=1): 1
          logW   E.logW           gap    SE.sim
[1,]  5.185578 5.085414 -0.1001632148 0.1102734
[2,]  4.438812 4.342562 -0.0962498606 0.1141643
[3,]  3.924028 3.884438 -0.0395891064 0.1231152
[4,]  3.564816 3.563931 -0.0008853886 0.1387907
[5,]  3.356504 3.327964 -0.0285393917 0.1486991
[6,]  3.245393 3.119016 -0.1263766015 0.1544081
[7,]  3.015978 2.914607 -0.1013708665 0.1815997
[8,]  2.812211 2.734495 -0.0777154881 0.1741944
[9,]  2.672545 2.561590 -0.1109558011 0.1775476
[10,] 2.656857 2.403220 -0.2536369287 0.1945162

Am I doing something wrong or does someone know a better way to get a good cluster number?

Best Answer

Clustering depends on scale, among other things. For discussions of this issue see (inter alia) When should you center and standardize data? and PCA on covariance or correlation?.

Here are your data drawn with a 1:1 aspect ratio, revealing how much the scales of the two variables differ:

Figure 1

To its right, the plot of the gap stats shows the statistics by number of clusters ($k$) with standard errors drawn with vertical segments and the optimal value of $k$ marked with a vertical dashed blue line. According to the clusGap help,

The default method "firstSEmax" looks for the smallest $k$ such that its value $f(k)$ is not more than 1 standard error away from the first local maximum.

Other methods behave similarly. This criterion does not cause any of the gap statistics to stand out, resulting in an estimate of $k=1$.

Choice of scale depends on the application, but a reasonable default starting point is a measure of dispersion of the data, such as the MAD or standard deviation. This plot repeats the analysis after recentering to zero and rescaling to make a unit standard deviation for each component $a$ and $b$:

Figure 2

The $k=2$ K-means solution is indicated by varying symbol type and color in the scatterplot of the data at left. Among the set $k\in\{1,2,3,4,5\}$, $k=2$ is clearly favored in the gap statistics plot at right: it is the first local maximum and the stats for smaller $k$ (that is, $k=1$) are significantly lower. Larger values of $k$ are likely overfit for such a small dataset, and none are significantly better than $k=2$. They are shown here only to illustrate the general method.


Here is R code to produce these figures. The data approximately match those shown in the question.

library(cluster)
xy <- matrix(c(29,391, 31,402, 31,380, 32.5,391, 32.5,360, 33,382, 33,371,
        34,405, 34,400, 34.5,404, 36,343, 36,320, 36,303, 37,344,
        38,358, 38,356, 38,351, 39,318, 40,322, 40, 341), ncol=2, byrow=TRUE)
colnames(xy) <- c("a", "b")
title <- "Raw data"
par(mfrow=c(1,2))
for (i in 1:2) {
  #
  # Estimate optimal cluster count and perform K-means with it.
  #
  gap <- clusGap(xy, kmeans, K.max=10, B=500)
  k <- maxSE(gap$Tab[, "gap"], gap$Tab[, "SE.sim"], method="Tibs2001SEmax")
  fit <- kmeans(xy, k)
  #
  # Plot the results.
  #
  pch <- ifelse(fit$cluster==1,24,16); col <- ifelse(fit$cluster==1,"Red", "Black")
  plot(xy, asp=1, main=title, pch=pch, col=col)
  plot(gap, main=paste("Gap stats,", title))
  abline(v=k, lty=3, lwd=2, col="Blue")
  #
  # Prepare for the next step.
  #
  xy <- apply(xy, 2, scale)
  title <- "Standardized data"
}