Solved – Kmeans cluster size change quite a bit on each run

clusteringk-means

I am running a kmeans on a sample size of 1000 data. The data is scaled (z).

When I run kmeans(df, nstart=25, centers=5)- it runs and I can get the size of each cluster. The largest group has 620 in it.

I ran it again (I did not mean to), but noticed the largest cluster then had 450 in it.

Out of interest, I kept trying and got the largest cluster to be anywhere between 450-800.

Now I know that the initialization can effect the clusters and cluster sizes. HOWEVER, I am surprised it varies so much. If my cluster sizes changed by 10-20, then I would get it. But these are large swings.

Of course I can set.seed to stop it- but it seems too odd to get that wide variation of results.

It seems too random that I might get the largest cluster of 450 or 600 or 775 based on when I ran it (or where it initialized). I feel uncomfortable with the results now.

Thoughts or explanations.

Best Answer

This indicates that k-means might not be a good clustering algorithm for your data.

Most likely, you don't have well-separated clusters of similar size and weight; which is when k-means works well. K-means makes some pretty strong assumptions on the data, and when these do not hold, the result can become as bad as any other random convex partitioning.

For example, k-means is known to suffer from outliers. Outliers can cause clusters to generate to a few outliers only, which effectively reduces your k. If this happens, the results of k=i and k=i+1 can become near identical.

In fact I recommend that people run k-means several times, and when they observe a high variance in the results, k-means was a bad choice. Also, visualize your results. Choose some 2 dimensional projection (e.g. on your primary components, or even a series of scatter plots) and try to visually inspect your result for usefulness. If it looks useless to you, it probably is.

Can you update your question with such projections of your data, in particular for two very different results?

Related Question