Clustering – How to Implement K-Means Cluster Analysis Algorithm Correctly?

algorithmsclusteringk-means

I am trying to implement the K-mean analysis with the Standard algorithm.

My implementation seems to work, but I noticed some strange behavior. If the k is close to half of the length of the list to be analyzed, I will get a set that is empty. I am not sure if it is the correct behavior.

I think the worst case is k equal to the length of the list, and each result sets has only 1 element. Empty result sets will happen if k is greater than the length of the list, but it is an invalid situation.

Best Answer

The behavior you describe is perfectly correct. Using such large sizes of $K$ w.r.t. to your list length is also one of the reasons why you get empty clusters. Be wise when choosing $K$ and your initial set of centroids (which I assume you sampled from your population).

Remember also that even though K-means is an optimization problem it does not define a convex function.

Last but not least, execute your K-means runs several times with the same K and compare results so you'll get an idea about the stability of your problem.