As far as I know k-means picks the initial centers randomly. Since they're based on pure luck, they can be selected really badly. The K-means++ algorithm tries to solve this problem, by spreading the initial centers evenly.
-
Do the two algorithms guarantee the same results? Or it is possible that the poorly chosen initial centroids lead to a bad result, not matter how many iterations.
-
Lets say there is a given dataset and a given number of desired clusters. We run a k-means algorithm as long as it converged (no more center move). Is there an exact solution to this cluster problem (given SSE), or k-means will produce sometimes different result at rerun?
-
If there is more than one solution to a clustering problem (given dataset, given number of clusters) , does K-means++ guarantee a better result, or just a faster? By better I mean lower SSE.
The reason I am asking these questions is because I am on the hunt for a k-means algorithm for clustering a huge dataset. I have found some k-means++, but there are some CUDA implementations too. As you already know CUDA is using the GPU, and it can run more hundreds of threads parallel. (So it can really speed up the whole process). But none of the CUDA implementations – which I have found so far – have k-means++ initialization.
Best Answer
K-means starts with allocating cluster centers randomly and then looks for "better" solutions. K-means++ starts with allocation one cluster center randomly and then searches for other centers given the first one. So both algorithms use random initialization as a starting point, so can give different results on different runs. As an example you can check this lecture: Clustering As An Example Inference Problem, around 40th minute there are examples of k-means runs, but whole lecture is interesting.
So, answering your questions:
As about your problem: what k-means++ does it chooses the centers and then starts a "classic" k-means. So what you can do is (1) use the part of algorithm that chooses centers and then (2) use those centers in the GPU implementations of k-means. This way at least a part of a problem is solved on GPU-based software, so should be faster.