Solved – Probability for selecting centroids – K-means++

clusteringk-meansprobabilityweighted-sampling

K-means++ selects centroids one by one, where each point has the chance to become next centroid with probability proportional to distance to closest centroid already selected.

I implemented it like this (for selecting one centroid):

  • Calculate distance for each point to existing centroids and save distance to closest centroid
  • Divide distance to closest centroid by total distance in cluster of that centroid (sum of distances from each point in cluster to centroid)
  • Sort distances for all points (in this step we consider that given distances that were divided by total distance represent probability for selecting the point as centroid)
  • Create an array of cummulative probabilities (for example, array 0.2, 0.3, 0.5 gives array 0.2, 0.5, 1)
  • Generate a random number between 0 and 1
  • New centroid is the point that is represented by the interval to which the generated number belongs

In this way third point from the example has the greatest probability being selected since it has the greatest interval.

Is this a good way to implement K-means++ like initialization?

The problem is that I'm not getting the expected result, so I'm not sure if I misunderstood the concept or I got something wrong with implementation.

The problem occurs with a relatively large dataset (5000 points) with 15 clusters. Concretely – it happens that multiple points from the same cluster, that are relatively close to each other, get selected as centroids.

I'm guessing that intervals get relatively small since there are 5000 points and then the concept of greater probabilities gets lost. For example, if there's 800 points in some cluster we get probabilities of about 1/800, so probability for one point could be for example 0.00125 and for the other one 0.00130, so it seems like I'm getting away from weighted probability and getting to uniform probability.

Am I missing something?

Best Answer

The paper (see section 2.2) suggests that you use the squared distance when computing probabilities. In fact, you can try distance$^{\ell}$ using any exponent $\ell$ greater than 1. It stands to reason that as $\ell$ increases, the likelihood of initial centroids being close together will go to zero.