Algorithms like Lloyds can be implemented with $k\cdot(2\cdot d + 1)$ floating point values memory use only. MacQueens k-means algorithm should only need $k\cdot(d + 1)$ memory.
However, as most users will want to know which point belongs to which cluster, almost every implementation you'll find will use $O(n+k\cdot d)$ memory.
In other words, the memory use by k-means is essentially the output data size.
If you randomly split the sample into 5 subsamples your 5 means will almost coincide. What is the sense of making such close points the initial cluster centres?
In many K-means implementations, the default selection of initial cluster centres is based on the opposite idea: to find the 5 points which are most far apart and make them the initial centres. You may ask what may be the way to find those far apart points? Here's what SPSS' K-means is doing for that:
Take any k cases (points) of the dataset as the initial centres. All the rest cases are being checked for the ability to substitute those as the initial centres, by the following conditions:
- a) If the case is farther from the centre closest to it than the
distance between two most close to each other centres, the case
substitutes that centre of the latter two to which it is
closer.
- b) If the case is farther from the centre 2nd closest to it than the
distance between the centre closest to it and the centre closest to
this latter one, the case substitutes the centre closest to
it.
If condition (a) is not satisfied, condition (b) is checked; if it is not satisfied either the case does not become a centre. As the result of such run through cases we obtain k utmost cases in the cloud which become the initial centres. The result of this algo, although robust enough, is not fully insensitive to the starting choice of "any k cases" and to the sort order of cases in the dataset; so, several random starting attempts are still welcome, as it is always the case with K-means.
See my answer with a list of popular initializing methods for k-means. Method of splitting into random subsamples (critisized here by me and others) as well as the described method used by SPSS - are on the list too.
Best Answer
Allow me, without going far, simply to copy-paste a list of options from my own function
!kmini
(a macro for SPSS), found in collection "Clustering" here.Method to create or select initial cluster centres. Choose:
k
nonoverlapping, by membership, groups, and centroids of these groups are appointed to be the initial centres. Thus, centres are calculated, not selected from the existent dataset cases. This method yields centres that lie close to each other and to the general centroid of the data.k
distinct cases of the data are randomly selected to be the initial centres.k
cases are taken as centres and then during the run through the rest of the cases of the dataset there progressively replacements among the centres are done; the aim of the replacements is to obtain in the endk
points most distant from each other in the variable space. These points (cases) occupying peripheral positions in the data cloud are the produced initial centres. (The method is used as the default in SPSS k-means procedureQUICK CLUSTER
. See details in SPSS Algorithms. See also described here).k
most representative, “deputy” cases. The 1st centre is taken as the case closest to the general data cenroid. Then the rest of the centres are selected from the data points in such a way that each point is considered as to whether it is closer (and how much, in terms of squared euclidean distance) to a set of points than each one of the latter is to any of the already existing centres. I.e. each point is examed as a candidate to represent some group of points not yet well enough represented by the centres already collected. Point most representative in this respect is selected as the next centre. (Kaufman, L. Rousseeuw, P.J. Finding groups in data: an introduction to cluster analysis., 1990. See also: Pena, J.M. et al. An empirical comparison of four initialization methods for the K-means algorithm // Pattern Recognition Lett. 20 (10), 1999, 1027-1040.)k
points which are from random uniform but "less random than random", somewhere between random and greed; see potential theoretical basis for that method]k
clusters produced by it are the initial seeds for k-means procedure. Ward's is preferable over other hierarchical clustering methods because it shares the common target objective with k-means.Methods RGC, RP, SIMFP, KMPP depend on random numbers and may change their result from run to run.
Method RUNFP may be sensitive to case order in the dataset; but method GREP is not (apart from occasions when there are many identical cases, ties, in the data). Method GREP may fail to collect all
k
centres ifk
is large relative the number of cases in the data (n
), especially whenk>n/2
. [The macro will inform if the data do not allow that method to collectk
centres]. Method GREP is the slowest one, it computes [in my implementation] matrix of distances between all cases, therefore it won’t suit if there are many tens of thousands or millions of cases. You could, however, do it on a random subsample of the data.I'm not discussing presently which method is "better" and in what circumstance, because I haven't done extensive simulational probing of the question so far. My very preliminary and superficial impressions have been that GREP is particularly worthy (but it is expensive), and that if you want really cheap method still competitive enough, then just random k points, RP, is a decent choice.