Here is my code, in Mathematica:
data = {{7, 1}, {3, 4}, {1, 5}, {5, 8}, {1, 3}, {7, 8}, {8, 2}, {5, 9}, {8, 0}};
centers = {};
RelativeWeights = Table[1/Length[data], {Length[data]}];
Table[
centers =
Union[RandomChoice[RelativeWeights -> data , 1], centers];
data = Complement[data, centers];
RelativeWeights =
Normalize@(EuclideanDistance[#1[[1]], Nearest[centers, #1]]^2 & /@
data);
{centers, data},
{3}] // TableForm
Here is how it works:
- The data set is defined
- We start with no centers
- The RelativeWeights (which govern the probability that a current data point will be selected to be a new, additional center) are initially set to be equal for all the current data points. Incidentally, Length[data] is merely the number of elements in the list called data.
- Now, Table is an iterator (like a DO statement), which here runs through the algorithm k times, where I set k = 3
- RandomChoice chooses 1 member of the set data according to the RelativeWeights, and adds this chosen center to the list of centers, and removes it from data
- The RelativeWeights are updated by taking each element in data in turn, and finding the Nearest element in the current list of centers, then computing its EuclideanDistance (squared) to that point. Your problem stated "distance," but the Wikipedia page for the algorithm stated distance squared. (The weights are then normalized to 1, to make a true discrete distribution, but this step is not needed as Mathematica automatically normalizes in this case)
- Then we store the current list of centers and data.
$\left(
\begin{array}{cc}
\left(
\begin{array}{cc}
1 & 5 \\
\end{array}
\right) & \left(
\begin{array}{cc}
1 & 3 \\
3 & 4 \\
5 & 8 \\
5 & 9 \\
7 & 1 \\
7 & 8 \\
8 & 0 \\
8 & 2 \\
\end{array}
\right) \\
\left(
\begin{array}{cc}
1 & 5 \\
7 & 8 \\
\end{array}
\right) & \left(
\begin{array}{cc}
1 & 3 \\
3 & 4 \\
5 & 8 \\
5 & 9 \\
7 & 1 \\
8 & 0 \\
8 & 2 \\
\end{array}
\right) \\
\left(
\begin{array}{cc}
1 & 3 \\
1 & 5 \\
7 & 8 \\
\end{array}
\right) & \left(
\begin{array}{cc}
3 & 4 \\
5 & 8 \\
5 & 9 \\
7 & 1 \\
8 & 0 \\
8 & 2 \\
\end{array}
\right) \\
\end{array}
\right)$
As you can see, the first data point chosen to be a center was (1,5), and the other points remained. Next, the point (7,8) was chosen and added to the list of centers, and so on.
The precise mathematics behind the third step is straightforward: For each point in data, find its distance to the nearest center, $d_i$. If there are $r$ elements currently in data, then you have $r$ distances--one for each point. The overall goal of kmeans++ is to choose new points from data that are FAR from existing centers, so we want to increase the probability of being chosen for points in data that are far from any center.
We do this as follows: We sum up all the $r$ distances to get $s_{tot}$:
$s_{tot} = \sum_{i=1}^r d_i$ .
For each point in data, we compute its distance divided by $s_{tot}$ and set that to the probability that point will be chosen as a new, additional, center:
$p_i = d_i/s_{tot}$.
Notice that the sum of all $r$ probabilities will add up to 1.0, as is required for a true probability.
Now, we want to choose a new point in data proportional to its probability $p_i$. Because of how we computed $p_i$, points far from any center (i.e., with large $p_i$) are more likely to be chosen than points with small $p_i$---just as the algorithm wants.
You can implement such a probability proportional selection by dividing up the unit interval ($0 \rightarrow 1$) by segments of length $p_i$ and uniformly choosing a value between 0 and 1 and finding which interval (i.e., which point) the random selected value lands in.
This is a full, precise mathematical explanation of kmeans++. I cannot see that there is anything more needed to describe it, especially since the working code is present for all.
In kmeans
help you can read that there is centers
argument that is
either the number of clusters, say k, or a set of initial (distinct)
cluster centres. If a number, a random set of (distinct) rows in x is
chosen as the initial centres.
So knowing that kmeans starts with allocating cluster centers randomly you can (1) choose by hand some random centers, (2) start the algorithm with iter.max=1
(single iteration), save results and (3) start it again with the previous output as centers
. Next you repeat (2) and (3) it until convergence and you have your data.
Generally there is no point in recording the initial values since they are random, so they are not recorded.
Best Answer
It is simple to implement the step 3. Assume that you have an array $D2$ that contains the squared of the distances computed in step 2. Now, we want to choose one of the points based on the weighted probability distribution according to $D2$. To do that, first normalize $D2$, that is, divide each element by the sum of $D2$. Then, generate a uniform random number $r$ in $[0,1]$. Then, choose the first element in the list that the sum of the values in the normalized $D2$ up to that element is greater than or equal to $r$. For example, if the normalized $D2$ is $[0.2,0.1,0.4,0.3]$ and $r$ is 0.55, the third element is selected.