Solved – the initial partition for k-means in R

clusteringk-meansmachine learningr

My question is probably elementary, and I apologize for that. I am reading Kogan's "Introduction to Clustering Large and High-Dimensional Data"; I am interested in understanding batch K-means and K-means and use it in $\operatorname{R}$. In the textbook it is stated that both algorithms need an initial choice of

  1. the number of clusters $K$
  2. An initial partition of the given dataset

Using such entries, the algorithms can perform the learning exercise. Kogan states that the initial partition is usually found using a Principal Direction Divisive Partitioning (PDDP) algorithm.

Looking at the K-means function kmeans in $\operatorname{R}$ I have noticed the absence of the initial partition as argument of the function itself. One can specify the number of clusters or a set of initial centers.

Moreover, the default K-mean algorithm used by kmeans is the one by Hartigan and Wong (1979). Unfortunately I have no access to the original paper, and I could not run through the original code, searching for the initial partition.

My questions are:

  • is there an initial partition choice hidden somewhere in kmeans? If yes, how is it chosen?
  • In absence of initial partition choice, how does kmeans begins to run (a high level overview would be great!)?

I thank you all.

Best Answer

Compute the centers of the predefined partitions, and use these to run k-means.

Try to find a library that has the book by Hartigan, his variant should be explained there.

A good article discussing the merits of his variant is this:

  • Hartigan’s Method: k-means Clustering without Voronoi
    Telgarsky, Vattani

And always remember that K-means is just a partitioning heuristic. It does not actually look for structure, and there are tons of toy examples where it just fails badly. Consider it a preprocessing method. In particular, when the Euclidean distance isn't what you really want to use on your data.

Being a good heuristic means it will often give reasonable good results; in particular when used in certain ways. Even when the results (e.g. due to outliers) may appear quite bad in a strict analysis of the clustering result.