If I have a certain dataset, how smart would it be to initialize cluster centers using means of random samples of that dataset?

For example, suppose I want `5 clusters`

. I take `5 random samples`

of say, `size=20%`

of the original dataset. Could I then take the mean of each of these 5 random samples and use those means as my 5 initial cluster centers? I don't know where I read this but I wanted to know what you guys think about the idea.

**UPDATE:** Please see this thread Initializing K-means clustering: what are the existing methods? for the general discussion about the various initialization methods.

## Best Answer

If you

randomlysplit the sample into 5 subsamples your 5 means will almost coincide. What is the sense of making such close points the initial cluster centres?In many K-means implementations, the default selection of initial cluster centres is based on the opposite idea: to find the 5 points which are most far apart and make them the initial centres. You may ask what may be the way to find those far apart points? Here's what SPSS' K-means is doing for that:

Take any

kcases (points) of the dataset as the initial centres. All the rest cases are being checked for the ability to substitute those as the initial centres, by the following conditions:If condition (a) is not satisfied, condition (b) is checked; if it is not satisfied either the case does not become a centre. As the result of such run through cases we obtain

kutmost cases in the cloud which become the initial centres. The result of this algo, although robust enough, is not fully insensitive to the starting choice of "anykcases" and to the sort order of cases in the dataset; so, several random starting attempts are still welcome, as it isalwaysthe case with K-means.See my answer with a list of popular initializing methods for k-means. Method of splitting into random subsamples (critisized here by me and others) as well as the described method used by SPSS - are on the list too.