Solved – Initializing K-means centers by the means of random subsamples of the dataset

clusteringk-meansunsupervised learning

If I have a certain dataset, how smart would it be to initialize cluster centers using means of random samples of that dataset?

For example, suppose I want 5 clusters. I take 5 random samples of say, size=20% of the original dataset. Could I then take the mean of each of these 5 random samples and use those means as my 5 initial cluster centers? I don't know where I read this but I wanted to know what you guys think about the idea.


UPDATE: Please see this thread Initializing K-means clustering: what are the existing methods? for the general discussion about the various initialization methods.

Best Answer

If you randomly split the sample into 5 subsamples your 5 means will almost coincide. What is the sense of making such close points the initial cluster centres?

In many K-means implementations, the default selection of initial cluster centres is based on the opposite idea: to find the 5 points which are most far apart and make them the initial centres. You may ask what may be the way to find those far apart points? Here's what SPSS' K-means is doing for that:

Take any k cases (points) of the dataset as the initial centres. All the rest cases are being checked for the ability to substitute those as the initial centres, by the following conditions:

  • a) If the case is farther from the centre closest to it than the distance between two most close to each other centres, the case substitutes that centre of the latter two to which it is closer.
  • b) If the case is farther from the centre 2nd closest to it than the distance between the centre closest to it and the centre closest to this latter one, the case substitutes the centre closest to it.

If condition (a) is not satisfied, condition (b) is checked; if it is not satisfied either the case does not become a centre. As the result of such run through cases we obtain k utmost cases in the cloud which become the initial centres. The result of this algo, although robust enough, is not fully insensitive to the starting choice of "any k cases" and to the sort order of cases in the dataset; so, several random starting attempts are still welcome, as it is always the case with K-means.

See my answer with a list of popular initializing methods for k-means. Method of splitting into random subsamples (critisized here by me and others) as well as the described method used by SPSS - are on the list too.