If I have a certain dataset, how smart would it be to initialize cluster centers using means of random samples of that dataset?
For example, suppose I want 5 clusters
. I take 5 random samples
of say, size=20%
of the original dataset. Could I then take the mean of each of these 5 random samples and use those means as my 5 initial cluster centers? I don't know where I read this but I wanted to know what you guys think about the idea.
UPDATE: Please see this thread Initializing K-means clustering: what are the existing methods? for the general discussion about the various initialization methods.
Best Answer
If you randomly split the sample into 5 subsamples your 5 means will almost coincide. What is the sense of making such close points the initial cluster centres?
In many K-means implementations, the default selection of initial cluster centres is based on the opposite idea: to find the 5 points which are most far apart and make them the initial centres. You may ask what may be the way to find those far apart points? Here's what SPSS' K-means is doing for that:
Take any k cases (points) of the dataset as the initial centres. All the rest cases are being checked for the ability to substitute those as the initial centres, by the following conditions:
If condition (a) is not satisfied, condition (b) is checked; if it is not satisfied either the case does not become a centre. As the result of such run through cases we obtain k utmost cases in the cloud which become the initial centres. The result of this algo, although robust enough, is not fully insensitive to the starting choice of "any k cases" and to the sort order of cases in the dataset; so, several random starting attempts are still welcome, as it is always the case with K-means.
See my answer with a list of popular initializing methods for k-means. Method of splitting into random subsamples (critisized here by me and others) as well as the described method used by SPSS - are on the list too.