I am using K-means to cluster my data and was looking for a way to suggest an "optimal" cluster number. Gap statistics seems to be a common way to find a good cluster number.
For some reason it returns 1 as optimal cluster number, but when I look at the data it's obvious that there are 2 clusters:
This is how I call gap in R:
gap <- clusGap(data, FUN=kmeans, K.max=10, B=500)
with(gap, maxSE(Tab[,"gap"], Tab[,"SE.sim"], method="firstSEmax"))
The result set:
> Number of clusters (method 'firstSEmax', SE.factor=1): 1
logW E.logW gap SE.sim
[1,] 5.185578 5.085414 -0.1001632148 0.1102734
[2,] 4.438812 4.342562 -0.0962498606 0.1141643
[3,] 3.924028 3.884438 -0.0395891064 0.1231152
[4,] 3.564816 3.563931 -0.0008853886 0.1387907
[5,] 3.356504 3.327964 -0.0285393917 0.1486991
[6,] 3.245393 3.119016 -0.1263766015 0.1544081
[7,] 3.015978 2.914607 -0.1013708665 0.1815997
[8,] 2.812211 2.734495 -0.0777154881 0.1741944
[9,] 2.672545 2.561590 -0.1109558011 0.1775476
[10,] 2.656857 2.403220 -0.2536369287 0.1945162
Am I doing something wrong or does someone know a better way to get a good cluster number?
Best Answer
Clustering depends on scale, among other things. For discussions of this issue see (inter alia) When should you center and standardize data? and PCA on covariance or correlation?.
Here are your data drawn with a 1:1 aspect ratio, revealing how much the scales of the two variables differ:
To its right, the plot of the gap stats shows the statistics by number of clusters ($k$) with standard errors drawn with vertical segments and the optimal value of $k$ marked with a vertical dashed blue line. According to the
clusGap
help,Other methods behave similarly. This criterion does not cause any of the gap statistics to stand out, resulting in an estimate of $k=1$.
Choice of scale depends on the application, but a reasonable default starting point is a measure of dispersion of the data, such as the MAD or standard deviation. This plot repeats the analysis after recentering to zero and rescaling to make a unit standard deviation for each component $a$ and $b$:
The $k=2$ K-means solution is indicated by varying symbol type and color in the scatterplot of the data at left. Among the set $k\in\{1,2,3,4,5\}$, $k=2$ is clearly favored in the gap statistics plot at right: it is the first local maximum and the stats for smaller $k$ (that is, $k=1$) are significantly lower. Larger values of $k$ are likely overfit for such a small dataset, and none are significantly better than $k=2$. They are shown here only to illustrate the general method.
Here is
R
code to produce these figures. The data approximately match those shown in the question.