Solved – Should one randomize the order in the data, for a k-means cluster analysis

k-means

I have read somewhere that it is better to randomize the order of your data several times, and perform each time the corresponding ulterior kmeans analysis, to be sure that your clustering results are consistent (reproducible). In this way, you would be able to find and define clusters that have not arisen by chance.

If that is the case, my questions are:
– Should you randomize the order of rows (samples) or columns (variables)? Or both?
– How many repetitions (that is, repetitions of randomization plus its corresponding kmeans analysis) would be convenient?

Best Answer

Order of cases (data points). There are two situations that come to mind about when changing the order of cases in the dataset can affect results of k-means clustering. And so, you better randomize then (randomize a few times and, if the results of the clusterings considerably differ, average the final centroids from these solutions and enter them as the initial centres for one final run).

  • You are using first (or last) k cases as the initial centres or are using a method of selection of the initial centres that is sensitive to case order.

  • You are using the so called running means or similar special version of k-means which is an "online clustering" method. Here, centroids (means) are recalculated every time a case is (re)assigned to a cluster, - whereas in classic (Lloyd's) k-means all centroids are recalculated once after all cases are assigned (i.e. once on an iteration).

Else k-means is insensitive to case order. (It remains potentially sensitive to the order of specification of the k initial centers, though. Specifically, when there are cases precisely equally distanced from different centres. You may change order of the initial centres and see if the solution comes out stable.)

K-means is insensitive to the order of variables (features) in the dataset.

Related Question