Solved – How random are the results of the kmeans algorithm

algorithmsclusteringk-means

I have a question regarding the kmeans algorithm. I know kmeans is a randomized algorithm, but how random is it and what results can I expect. Suppose you have clustered a dataset into $4$ clusters, where each point has identity $1$, $2$, $3$, or $4$ (which tells you what cluster it belongs to). You then perform a second clustering on the same dataset with the same criteria.

  1. Will all the points in a specific cluster say during the first clustering be in the same cluster the next time you apply the kmeans algorithm?
  2. If not, will they most likely be in the same cluster and is there some measure for this likelihood?

By some output I received in R, I believe that 1. does not hold as I get different cluster sizes for different runs on the same dataset.

All help is greatly appreciated!

Best Answer

There is more than one k-means algorithm.

You probably refer to Lloyds algorithm, which only depends on the initial cluster centers. But there also is MacQueen's, which depends on the sequence i.e. ordering of points. Then there is Hartigan, Wong, Forgy, ...

And of course, various implementations may have implementation and optimization differences. They may treat ties differently, too! For example, many naive implementations will always assign elements to the first or last cluster when tied. Others will preserve the current clustering assignment. So when clustering integer values, where ties are much more common, but also on the Iris data set, you may see artifacts and differences caused by this.

Furthermore, the clusters may end up being reordered by memory address after finishing k-means, so you cannot safely assume that cluster 1 remains cluster 1 even if k-means converged after the first iteration. Others will reorder clusters by cluster size (which actually makes sense for k-means, as this will more likely return the same result on different random initialization)

But assuming that all iterate Lloyd until convergence (which original MacQueen k-means didn't!) they should all at least arrive at a local optimum. There will be only oh-so-many local optima...

Consider for example the data set generated by $p_j=(\sin(2\pi \frac{j}{n}), \cos(2\pi \frac{j}{n}))$, and let $n$ be divisible by $j$. There will be a lot of local optimal solutions. Running k-means with different random seeds will indeed give your very different solutions. For appropriate parameters, I believe the chance of two different elements that were in the same cluster to be in the same cluster again in another result will be somewhere around $50\%$. In higher dimensionality, you can probably further reduce this number. For example in the $n$ dimensional data set where $p_{jj}=1$ and $p_{ij}=0$ for $i\neq j$, all points are equidistant. It's easy to see that this will cause havoc to k-means...

Related Question