Solved – Why is the k-means++ algorithm probabilistic

clusteringk-meansprobability

The k-means++ algorithm provides a technique to choose the initial k seeds for the k-means algorithm. It does this by sampling the next point according to a multinomial distribution over the unchosen points (where the probability of a point being chosen as the next center is proportional to $D(x)^2$ with $D(x)$ being the distance of the point $x$ to its nearest center).

The point with the largest distance has the greatest probability of being chosen, but why can I not choose this point every time? What advantage do I gain by being 'fuzzy' with my seed selection?

Best Answer

You gain in theoretical guarantees on the solutions: the solution found by k-means initialized this way is close to the correct k-means solution (in expectation) with a known constant, cf. these slides for example.

With the method you mention (which was previously used in the literature), you can build some configurations where it behaves badly (think of a point on a separating hyperplane but far far away) for sure (since deterministic).