Solved – Why is the k-means++ algorithm probabilistic

clusteringk-meansprobability

The k-means++ algorithm provides a technique to choose the initial k seeds for the k-means algorithm. It does this by sampling the next point according to a multinomial distribution over the unchosen points (where the probability of a point being chosen as the next center is proportional to $D(x)^2$ with $D(x)$ being the distance of the point $x$ to its nearest center).

The point with the largest distance has the greatest probability of being chosen, but why can I not choose this point every time? What advantage do I gain by being 'fuzzy' with my seed selection?

Best Answer

You gain in theoretical guarantees on the solutions: the solution found by k-means initialized this way is close to the correct k-means solution (in expectation) with a known constant, cf. these slides for example.

With the method you mention (which was previously used in the literature), you can build some configurations where it behaves badly (think of a point on a separating hyperplane but far far away) for sure (since deterministic).

Related Solutions

Solved – How to define a posterior probability of y given x when the model is not probabilistic

Since you can view k-means as a sort of impoverished Mixture of Normals (specifically with 0 variance), I'd be tempted to use the density function of the Normal distribution if you need a probabilistic metric. If you're willing to assume equal variances in all the clusters you can ignore the variance in the density function, and normalise by the distance across all clusters (you could also include a prior probability of the cluster as the fraction of points assigned to it as well).

It's not pretty and it's not really theoretically justified, but it may suffice.

K-Means++ Algorithm – Understanding the K-Means++ Algorithm

Here is my code, in Mathematica:

data = {{7, 1}, {3, 4}, {1, 5}, {5, 8}, {1, 3}, {7, 8}, {8, 2}, {5, 9}, {8, 0}};
centers = {};
RelativeWeights = Table[1/Length[data], {Length[data]}];
Table[
  centers = 
   Union[RandomChoice[RelativeWeights -> data , 1], centers];
  data = Complement[data, centers];
  RelativeWeights = 
   Normalize@(EuclideanDistance[#1[[1]], Nearest[centers, #1]]^2 & /@ 
      data);
  {centers, data},
  {3}] // TableForm

Here is how it works:

The data set is defined
We start with no centers
The RelativeWeights (which govern the probability that a current data point will be selected to be a new, additional center) are initially set to be equal for all the current data points. Incidentally, Length[data] is merely the number of elements in the list called data.
Now, Table is an iterator (like a DO statement), which here runs through the algorithm k times, where I set k = 3
RandomChoice chooses 1 member of the set data according to the RelativeWeights, and adds this chosen center to the list of centers, and removes it from data
The RelativeWeights are updated by taking each element in data in turn, and finding the Nearest element in the current list of centers, then computing its EuclideanDistance (squared) to that point. Your problem stated "distance," but the Wikipedia page for the algorithm stated distance squared. (The weights are then normalized to 1, to make a true discrete distribution, but this step is not needed as Mathematica automatically normalizes in this case)
Then we store the current list of centers and data.

$\left( \begin{array}{cc} \left( \begin{array}{cc} 1 & 5 \\ \end{array} \right) & \left( \begin{array}{cc} 1 & 3 \\ 3 & 4 \\ 5 & 8 \\ 5 & 9 \\ 7 & 1 \\ 7 & 8 \\ 8 & 0 \\ 8 & 2 \\ \end{array} \right) \\ \left( \begin{array}{cc} 1 & 5 \\ 7 & 8 \\ \end{array} \right) & \left( \begin{array}{cc} 1 & 3 \\ 3 & 4 \\ 5 & 8 \\ 5 & 9 \\ 7 & 1 \\ 8 & 0 \\ 8 & 2 \\ \end{array} \right) \\ \left( \begin{array}{cc} 1 & 3 \\ 1 & 5 \\ 7 & 8 \\ \end{array} \right) & \left( \begin{array}{cc} 3 & 4 \\ 5 & 8 \\ 5 & 9 \\ 7 & 1 \\ 8 & 0 \\ 8 & 2 \\ \end{array} \right) \\ \end{array} \right)$

As you can see, the first data point chosen to be a center was (1,5), and the other points remained. Next, the point (7,8) was chosen and added to the list of centers, and so on.

The precise mathematics behind the third step is straightforward: For each point in data, find its distance to the nearest center, $d_i$. If there are $r$ elements currently in data, then you have $r$ distances--one for each point. The overall goal of kmeans++ is to choose new points from data that are FAR from existing centers, so we want to increase the probability of being chosen for points in data that are far from any center.

We do this as follows: We sum up all the $r$ distances to get $s_{tot}$:

$s_{tot} = \sum_{i=1}^r d_i$ .

For each point in data, we compute its distance divided by $s_{tot}$ and set that to the probability that point will be chosen as a new, additional, center:

$p_i = d_i/s_{tot}$.

Notice that the sum of all $r$ probabilities will add up to 1.0, as is required for a true probability.

Now, we want to choose a new point in data proportional to its probability $p_i$. Because of how we computed $p_i$, points far from any center (i.e., with large $p_i$) are more likely to be chosen than points with small $p_i$---just as the algorithm wants.

You can implement such a probability proportional selection by dividing up the unit interval ($0 \rightarrow 1$) by segments of length $p_i$ and uniformly choosing a value between 0 and 1 and finding which interval (i.e., which point) the random selected value lands in.

This is a full, precise mathematical explanation of kmeans++. I cannot see that there is anything more needed to describe it, especially since the working code is present for all.

Best Answer

Related Solutions

Solved – How to define a posterior probability of y given x when the model is not probabilistic

K-Means++ Algorithm – Understanding the K-Means++ Algorithm

Related Question