Solved – Using KMeans++ Computing Weighted Probability for KMeans Initialization

distributionsk-meansneural networks

There are times where I get less than K clusters in a K-Means algorithm implemented in Java so I searched for a solution. One thread suggested using KMeans++ for initialization but I am stucked on how to implement the algorithm (in Wiki) most especially the 3rd step. I am wondering how to implement in on a lattice where I am doing the K-Means algorithm.

This is the exact algorithm.

Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)^2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.

I am a bit confused on how to implement it most especially step 3 in choosing with probability proportional to D(x)^2.

Best Answer

It is simple to implement the step 3. Assume that you have an array $D2$ that contains the squared of the distances computed in step 2. Now, we want to choose one of the points based on the weighted probability distribution according to $D2$. To do that, first normalize $D2$, that is, divide each element by the sum of $D2$. Then, generate a uniform random number $r$ in $[0,1]$. Then, choose the first element in the list that the sum of the values in the normalized $D2$ up to that element is greater than or equal to $r$. For example, if the normalized $D2$ is $[0.2,0.1,0.4,0.3]$ and $r$ is 0.55, the third element is selected.

Related Solutions

K-Means++ Algorithm – Understanding the K-Means++ Algorithm

Here is my code, in Mathematica:

data = {{7, 1}, {3, 4}, {1, 5}, {5, 8}, {1, 3}, {7, 8}, {8, 2}, {5, 9}, {8, 0}};
centers = {};
RelativeWeights = Table[1/Length[data], {Length[data]}];
Table[
  centers = 
   Union[RandomChoice[RelativeWeights -> data , 1], centers];
  data = Complement[data, centers];
  RelativeWeights = 
   Normalize@(EuclideanDistance[#1[[1]], Nearest[centers, #1]]^2 & /@ 
      data);
  {centers, data},
  {3}] // TableForm

Here is how it works:

The data set is defined
We start with no centers
The RelativeWeights (which govern the probability that a current data point will be selected to be a new, additional center) are initially set to be equal for all the current data points. Incidentally, Length[data] is merely the number of elements in the list called data.
Now, Table is an iterator (like a DO statement), which here runs through the algorithm k times, where I set k = 3
RandomChoice chooses 1 member of the set data according to the RelativeWeights, and adds this chosen center to the list of centers, and removes it from data
The RelativeWeights are updated by taking each element in data in turn, and finding the Nearest element in the current list of centers, then computing its EuclideanDistance (squared) to that point. Your problem stated "distance," but the Wikipedia page for the algorithm stated distance squared. (The weights are then normalized to 1, to make a true discrete distribution, but this step is not needed as Mathematica automatically normalizes in this case)
Then we store the current list of centers and data.

$\left( \begin{array}{cc} \left( \begin{array}{cc} 1 & 5 \\ \end{array} \right) & \left( \begin{array}{cc} 1 & 3 \\ 3 & 4 \\ 5 & 8 \\ 5 & 9 \\ 7 & 1 \\ 7 & 8 \\ 8 & 0 \\ 8 & 2 \\ \end{array} \right) \\ \left( \begin{array}{cc} 1 & 5 \\ 7 & 8 \\ \end{array} \right) & \left( \begin{array}{cc} 1 & 3 \\ 3 & 4 \\ 5 & 8 \\ 5 & 9 \\ 7 & 1 \\ 8 & 0 \\ 8 & 2 \\ \end{array} \right) \\ \left( \begin{array}{cc} 1 & 3 \\ 1 & 5 \\ 7 & 8 \\ \end{array} \right) & \left( \begin{array}{cc} 3 & 4 \\ 5 & 8 \\ 5 & 9 \\ 7 & 1 \\ 8 & 0 \\ 8 & 2 \\ \end{array} \right) \\ \end{array} \right)$

As you can see, the first data point chosen to be a center was (1,5), and the other points remained. Next, the point (7,8) was chosen and added to the list of centers, and so on.

The precise mathematics behind the third step is straightforward: For each point in data, find its distance to the nearest center, $d_i$. If there are $r$ elements currently in data, then you have $r$ distances--one for each point. The overall goal of kmeans++ is to choose new points from data that are FAR from existing centers, so we want to increase the probability of being chosen for points in data that are far from any center.

We do this as follows: We sum up all the $r$ distances to get $s_{tot}$:

$s_{tot} = \sum_{i=1}^r d_i$ .

For each point in data, we compute its distance divided by $s_{tot}$ and set that to the probability that point will be chosen as a new, additional, center:

$p_i = d_i/s_{tot}$.

Notice that the sum of all $r$ probabilities will add up to 1.0, as is required for a true probability.

Now, we want to choose a new point in data proportional to its probability $p_i$. Because of how we computed $p_i$, points far from any center (i.e., with large $p_i$) are more likely to be chosen than points with small $p_i$---just as the algorithm wants.

You can implement such a probability proportional selection by dividing up the unit interval ($0 \rightarrow 1$) by segments of length $p_i$ and uniformly choosing a value between 0 and 1 and finding which interval (i.e., which point) the random selected value lands in.

This is a full, precise mathematical explanation of kmeans++. I cannot see that there is anything more needed to describe it, especially since the working code is present for all.

Solved – K-means initial centers membership

In kmeans help you can read that there is centers argument that is

either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.

So knowing that kmeans starts with allocating cluster centers randomly you can (1) choose by hand some random centers, (2) start the algorithm with iter.max=1 (single iteration), save results and (3) start it again with the previous output as centers. Next you repeat (2) and (3) it until convergence and you have your data.

Generally there is no point in recording the initial values since they are random, so they are not recorded.

Best Answer

Related Solutions

K-Means++ Algorithm – Understanding the K-Means++ Algorithm

Solved – K-means initial centers membership

Related Question