Solved – k-means vs k-means++

k-means

As far as I know k-means picks the initial centers randomly. Since they're based on pure luck, they can be selected really badly. The K-means++ algorithm tries to solve this problem, by spreading the initial centers evenly.

  • Do the two algorithms guarantee the same results? Or it is possible that the poorly chosen initial centroids lead to a bad result, not matter how many iterations.

  • Lets say there is a given dataset and a given number of desired clusters. We run a k-means algorithm as long as it converged (no more center move). Is there an exact solution to this cluster problem (given SSE), or k-means will produce sometimes different result at rerun?

  • If there is more than one solution to a clustering problem (given dataset, given number of clusters) , does K-means++ guarantee a better result, or just a faster? By better I mean lower SSE.

The reason I am asking these questions is because I am on the hunt for a k-means algorithm for clustering a huge dataset. I have found some k-means++, but there are some CUDA implementations too. As you already know CUDA is using the GPU, and it can run more hundreds of threads parallel. (So it can really speed up the whole process). But none of the CUDA implementations – which I have found so far – have k-means++ initialization.

Best Answer

K-means starts with allocating cluster centers randomly and then looks for "better" solutions. K-means++ starts with allocation one cluster center randomly and then searches for other centers given the first one. So both algorithms use random initialization as a starting point, so can give different results on different runs. As an example you can check this lecture: Clustering As An Example Inference Problem, around 40th minute there are examples of k-means runs, but whole lecture is interesting.

So, answering your questions:

  • No, because there is a random initialization different runs can give different results (see examples in the lecture). They should give comparable results but this is not guaranteed. Also, as all the centers are initialized randomly in k-means, it can give different results than k-means++.
  • K-means can give different results on different runs.
  • The k-means++ paper provides monte-carlo simulation results that show that k-means++ is both faster and provides a better performance, so there is no guarantee, but it may be better.

As about your problem: what k-means++ does it chooses the centers and then starts a "classic" k-means. So what you can do is (1) use the part of algorithm that chooses centers and then (2) use those centers in the GPU implementations of k-means. This way at least a part of a problem is solved on GPU-based software, so should be faster.