Clustering – Why Elbow Method Is Not Giving a Proper Curve

clusteringk-meansr

I am trying to determine how many clusters to use for my k-means clustering using different methods. Gap statistic is giving me k=4 and Silhouette k=3. I have run k-means with both values and both of them seem to give decent result, but I still do not know which of them are the best. So I used the elbow method as well, in hope of it giving me either 3 or 4 but the plot looks strange and I cannot determine what k should be according to the plot. The total within sum of squares decrease by k=4, but suddenly on k=5 it increases and decreases once again on k=6, creating a "peak" between k=4 and k=6.
Elbow method plot

I am using the function "fviz_nbclust()" from the package "factoextra" in R:

fviz_nbclust(dataset, kmeans, method = "wss")
fviz_nbclust(dataset, kmeans, method = "silhouette")
fviz_nbclust(dataset, kmeans, method = "gap_stat")

Any advice would be helpful, as I am fairly new to the subject of clustering and may have missed important or basic knowledge.

Best Answer

For kmeans, the default is using nstart=1 , meaning it tries one configuration of centers, and depending on your data, it might give not give a within ss that is smaller than the lower k. Also, it will not give you the same clusters within runs.

For example:

set.seed(675)
M = matrix(rnbinom(1000,mu=10,size=1),ncol=10)
fviz_nbclust(M,kmeans,method="wss")

enter image description here

We increase the number of nstart and this should go away:

fviz_nbclust(M,kmeans,nstart=10,method="wss")

enter image description here

And I think it might make sense for you to set nstart for the other test.