Solved – In k-means, can two initial random centeroids be same

clusteringk-means

The k-means algorithm does the following:

  1. Given a set of points, we first choose k random points to be the initial centroids.

  2. We then create k clusters. The ith cluster contains the points nearest to the ith centroid.

  3. Now, we calculate the means of points in each cluster. Thus we have k means. These k means form the updated k centroids. We then create new k clusters as described in step 2.

  4. We keep doing steps 2. and 3. until the k centroids stop changing positions

My question is – what happens if two or more of the initial randomly chosen centroids turn out to be the same (I'm talking about the centroids chosen randomly in the first step)?

My first intuition is that we would just proceed as it is, without any changes to the algorithm.

Is this how that works? Or is there a guarantee that the initial centers are never going to be the same?

Best Answer

Yes they can be the same.

Assume a worst case: we have n > k identical points.

Then all centers must be the same.

The problem is that k-means itself cannot adjust k if such things happen.

In reality, it usually will not be a problem. Even if you have two identical centers in the beginning, one will usually change after the first iteration, and then they will move into different directions.

Beware that in some situations, k-means clusters can become empty, too. It appears the most stable approach then is to keep the previous iterations mean for these clusters.