My question is probably elementary, and I apologize for that. I am reading Kogan's "Introduction to Clustering Large and High-Dimensional Data"; I am interested in understanding batch K-means and K-means and use it in $\operatorname{R}$. In the textbook it is stated that both algorithms need an initial choice of
- the number of clusters $K$
- An initial partition of the given dataset
Using such entries, the algorithms can perform the learning exercise. Kogan states that the initial partition is usually found using a Principal Direction Divisive Partitioning (PDDP) algorithm.
Looking at the K-means function kmeans
in $\operatorname{R}$ I have noticed the absence of the initial partition as argument of the function itself. One can specify the number of clusters or a set of initial centers.
Moreover, the default K-mean algorithm used by kmeans
is the one by Hartigan and Wong (1979). Unfortunately I have no access to the original paper, and I could not run through the original code, searching for the initial partition.
My questions are:
- is there an initial partition choice hidden somewhere in
kmeans
? If yes, how is it chosen? - In absence of initial partition choice, how does
kmeans
begins to run (a high level overview would be great!)?
I thank you all.
Best Answer
Compute the centers of the predefined partitions, and use these to run k-means.
Try to find a library that has the book by Hartigan, his variant should be explained there.
A good article discussing the merits of his variant is this:
Telgarsky, Vattani
And always remember that K-means is just a partitioning heuristic. It does not actually look for structure, and there are tons of toy examples where it just fails badly. Consider it a preprocessing method. In particular, when the Euclidean distance isn't what you really want to use on your data.
Being a good heuristic means it will often give reasonable good results; in particular when used in certain ways. Even when the results (e.g. due to outliers) may appear quite bad in a strict analysis of the clustering result.