Solved – K-means in R, high nstart gives tiny clusters $(n=1)$

clusteringk-meansr

I am using kmeans() to cluster standardized scores from a factor analysis in R (20 variables, 919 cases).

As R uses random cases for the initial centroids, I was hoping that choosing a high value for $nstart$, such as 25 or 50 would help stabilize the solution.

However, this frequently (many combinations/ runs of 6 or 7 clusters and $nstart > 10$) results in a cluster with $n = 1$.

  • What could be the reason for this and how should I deal with it?
  • Is stabilizing the solution through a higher nstart generally a good
    idea?

Best Answer

Mechanistically, you can use R to help identify the appropriate number of clusters:

wssplot <- function(data, nc=15, seed=1234){
               wss <- (nrow(data)-1)*sum(apply(data,2,var))
               for (i in 2:nc){
                    set.seed(seed)
                    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
                plot(1:nc, wss, type="b", xlab="Number of Clusters",
                     ylab="Within groups sum of squares")}

Alternatively:

library(NbClust)

nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans")
table(nc$Best.n[1,])

This R in Action link provides more details, but it is recommended to use >25 initial configurations.

Reasons for the single cluster would either be that you're only selecting one to start or that the data don't separate very well.