Solved – the initial partition for k-means in R

My question is probably elementary, and I apologize for that. I am reading Kogan's "Introduction to Clustering Large and High-Dimensional Data"; I am interested in understanding batch K-means and K-means and use it in $\operatorname{R}$. In the textbook it is stated that both algorithms need an initial choice of

the number of clusters $K$
An initial partition of the given dataset

Using such entries, the algorithms can perform the learning exercise. Kogan states that the initial partition is usually found using a Principal Direction Divisive Partitioning (PDDP) algorithm.

Looking at the K-means function kmeans in $\operatorname{R}$ I have noticed the absence of the initial partition as argument of the function itself. One can specify the number of clusters or a set of initial centers.

Moreover, the default K-mean algorithm used by kmeans is the one by Hartigan and Wong (1979). Unfortunately I have no access to the original paper, and I could not run through the original code, searching for the initial partition.

My questions are:

is there an initial partition choice hidden somewhere in kmeans? If yes, how is it chosen?
In absence of initial partition choice, how does kmeans begins to run (a high level overview would be great!)?

I thank you all.

Best Answer

Compute the centers of the predefined partitions, and use these to run k-means.

Try to find a library that has the book by Hartigan, his variant should be explained there.

A good article discussing the merits of his variant is this:

Hartigan’s Method: k-means Clustering without Voronoi
Telgarsky, Vattani

And always remember that K-means is just a partitioning heuristic. It does not actually look for structure, and there are tons of toy examples where it just fails badly. Consider it a preprocessing method. In particular, when the Euclidean distance isn't what you really want to use on your data.

Being a good heuristic means it will often give reasonable good results; in particular when used in certain ways. Even when the results (e.g. due to outliers) may appear quite bad in a strict analysis of the clustering result.

Best Answer

Related Solutions

Solved – Measuring k-means clustering quality on training and test sets

Solved – k-means vs k-means++

Related Question