Solved – Methods of initializing K-means clustering

clusteringk-means

I am interested in the current state of the art for selecting initial seeds (cluster centers) for K-means.

Googling leads to two popular choices:

  1. random selection of initial seeds, and,
  2. using the KMeans++ selection technique: Arthur & Vassilvitskii 2006 k-means++: The Advantages of Careful Seeding

Are there any other promising methods that anyone here is aware of, which might not be as popular?

Best Answer

Allow me, without going far, simply to copy-paste a list of options from my own function !kmini (a macro for SPSS), found in collection "Clustering" here.

Method to create or select initial cluster centres. Choose:

  • RGC - centroids of random subsamples. The data are partitioned randomly by k nonoverlapping, by membership, groups, and centroids of these groups are appointed to be the initial centres. Thus, centres are calculated, not selected from the existent dataset cases. This method yields centres that lie close to each other and to the general centroid of the data.
  • RP - randomly selected points. k distinct cases of the data are randomly selected to be the initial centres.
  • RUNFP - farthest points (running selection). First k cases are taken as centres and then during the run through the rest of the cases of the dataset there progressively replacements among the centres are done; the aim of the replacements is to obtain in the end k points most distant from each other in the variable space. These points (cases) occupying peripheral positions in the data cloud are the produced initial centres. (The method is used as the default in SPSS k-means procedure QUICK CLUSTER. See details in SPSS Algorithms. See also described here).
  • SIMFP - farthest points (simple selection). The first centre is selected as a random case from the dataset. The 2nd centre is selected as the case maximally distant from that centre. The 3rd centre is selected as the case maximally distant from those two (from the nearest of the two), - and so on.
  • KMPP - random farthest points, or k-means++. The first centre is selected as a random case from the dataset. The 2nd centre is selected also randomly, but the probability of selection of a case is proportional to the distance (square euclidean) of it to that (1st) centre. The 3rd centre is selected also randomly with the probability of selection proportional to the distance of a case to the nearest of those two centres, - and so on. (Arthur, D., Vassilvitskii, S.. K-means++: the advantages of careful seeding. // Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms. 2007., 1027–1035.)
  • GREP - group representative points. The method idea – to collect as centres k most representative, “deputy” cases. The 1st centre is taken as the case closest to the general data cenroid. Then the rest of the centres are selected from the data points in such a way that each point is considered as to whether it is closer (and how much, in terms of squared euclidean distance) to a set of points than each one of the latter is to any of the already existing centres. I.e. each point is examed as a candidate to represent some group of points not yet well enough represented by the centres already collected. Point most representative in this respect is selected as the next centre. (Kaufman, L. Rousseeuw, P.J. Finding groups in data: an introduction to cluster analysis., 1990. See also: Pena, J.M. et al. An empirical comparison of four initialization methods for the K-means algorithm // Pattern Recognition Lett. 20 (10), 1999, 1027-1040.)
  • [There is also a nice method, not yet implemented by me in the macro, to generate k points which are from random uniform but "less random than random", somewhere between random and greed; see potential theoretical basis for that method]
  • One more method is to do hierarchical clustering by Ward's method. You may do it on subsample of objects if the sample is too big. Then means of the k clusters produced by it are the initial seeds for k-means procedure. Ward's is preferable over other hierarchical clustering methods because it shares the common target objective with k-means.

Methods RGC, RP, SIMFP, KMPP depend on random numbers and may change their result from run to run.

Method RUNFP may be sensitive to case order in the dataset; but method GREP is not (apart from occasions when there are many identical cases, ties, in the data). Method GREP may fail to collect all k centres if k is large relative the number of cases in the data (n), especially when k>n/2. [The macro will inform if the data do not allow that method to collect k centres]. Method GREP is the slowest one, it computes [in my implementation] matrix of distances between all cases, therefore it won’t suit if there are many tens of thousands or millions of cases. You could, however, do it on a random subsample of the data.

I'm not discussing presently which method is "better" and in what circumstance, because I haven't done extensive simulational probing of the question so far. My very preliminary and superficial impressions have been that GREP is particularly worthy (but it is expensive), and that if you want really cheap method still competitive enough, then just random k points, RP, is a decent choice.

Related Question