Solved – Using KMeans++ Computing Weighted Probability for KMeans Initialization

distributionsk-meansneural networks

There are times where I get less than K clusters in a K-Means algorithm implemented in Java so I searched for a solution. One thread suggested using KMeans++ for initialization but I am stucked on how to implement the algorithm (in Wiki) most especially the 3rd step. I am wondering how to implement in on a lattice where I am doing the K-Means algorithm.

This is the exact algorithm.

  1. Choose one center uniformly at random from among the data points.

  2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.

  3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)^2.

  4. Repeat Steps 2 and 3 until k centers have been chosen.

  5. Now that the initial centers have been chosen, proceed using standard k-means clustering.

I am a bit confused on how to implement it most especially step 3 in choosing with probability proportional to D(x)^2.

Best Answer

It is simple to implement the step 3. Assume that you have an array $D2$ that contains the squared of the distances computed in step 2. Now, we want to choose one of the points based on the weighted probability distribution according to $D2$. To do that, first normalize $D2$, that is, divide each element by the sum of $D2$. Then, generate a uniform random number $r$ in $[0,1]$. Then, choose the first element in the list that the sum of the values in the normalized $D2$ up to that element is greater than or equal to $r$. For example, if the normalized $D2$ is $[0.2,0.1,0.4,0.3]$ and $r$ is 0.55, the third element is selected.

Related Question