Solved – Repeating k-means, is it helpful

clusteringk-means

I'm working with k-means algorithm, and I'm proceeding in this way:

  1. I've run k-means from 2 to n clusters, I plotted the k-means result of the variance, to get the "elbow", to decide the best trade-off between model and number of clusters, the best result is 4;
  2. I repeated k-means n times, let's say 20,000 times, each time with 4 clusters;
  3. I've stored all the coordinates of the centroids calculated;
  4. I've calculated the mean coordinates of each centroids, that is a mean based on all the 20,000 iteration;
  5. With the new centroids, I calculated for each point in my dateset the smallest euclidean distance, to get the "best" centroids for each point.

The problem is that I got my points distributed in only 3 centroids, i.e. the fourth is not given to anyone. This is in contrast also with the point 1, but if I decided to have 5,6, etc. cluster, there is always an cluster not used.

I know that this mean that a centroids is too much away from my points, but is it correct? And my procedure, is it meaningful? Should I calculate the variance to see the goodness of fit of the model?

Best Answer

Based on your comment, I see that you just take the average of, let's say cluster 1 centers, to come up with the mean of cluster 1 center, i.e. $$\bar{C_{1}}=\frac{1}{n}\sum_{n=1}^{20K} C_{1}^{(i)}$$ which makes your analysis invalid if I'm not missing anything because, even if the k-means algorithm assigns the exact same clusters to your data (i.e. exactly same means), they may come in different order. So, cluster numberings don't possess a specific ordering, which in turn means that their average doesn't make much sense.

There are methods for computing cluster similarities using comembership measures as in cluster_similarity, however even if you get an aligned ordering for iterations $i$ and $j$, they might differ in subsequent iterations. I can't seem to find a straightforward way to mean these cluster centers. Typically, the best one is chosen.