K-Means Clustering – What Are the Assumptions of K-Means Algorithm?

assumptionsclusteringhypothesis testingk-meansreferences

I'm trying to understand what are the assumptions/hypothesis underlying the k-means clustering algorythm; specifically, I'm looking for a research/academic paper listing such hypothesis and explaining the ratio behind them.

Browsing on the internet, I found the following ones:

  1. spherical clusters: all within cluster data are disposed of the center of the cluster;
  2. cluster size: the size of the clusters in the sample is similar among the clusters identified;

Can you provide some reference about?

Moreover, once you provided that, it would be nice to have the methodology steps to test for such assumptions.

Thanks all in advance!!

Best Answer

This is a complicated question, as I believe that the role of model assumptions in statistics is generally widely misunderstood, and the situation for k-means is even less clear than for many other situations.

Generally having a "model assumption" means that there exist a theoretical result that a method does an in some sense good or even optimal job if the model assumption is in fact fulfilled. However, model assumptions are never precisely fulfilled in real data, so it doesn't make sense to say that "model assumptions have to be fulfilled". It is more important to understand what happens if they are not fulfilled, and this pretty much always depends on how exactly they are not fulfilled.

Some statements regarding k-means:

  1. k-means can be derived as maximum likelihood estimator under a certain model for clusters that are normally distributed with a spherical covariance matrix, the same for all clusters.

Bock, H. H. (1996) Probabilistic models in cluster analysis. Computational Statistics & Data Analysis, 23, 5–28. URL: http://www.sciencedirect.com/science/article/pii/0167947396889195

  1. Pollard has shown a general consistency result for k-means in a nonparametric setup, meaning that for (pretty much, existing second moments assumed) all distributions k-means is a consistent estimator of its own canonical functional (see below).

Pollard, D. (1981) Strong consistency of k-means clustering. Annals of Statistics, 9, 135–140. URL: https://doi.org/10.1214/aos/1176345339.

  1. Clustering can generally be interpreted as "constructive", meaning that a clustering method can be seen as not in the first place recovering "true" underlying clusters following a certain model, but rather as constructing a grouping that satisfies certain criteria defined by the clustering method itself. As such, the k-means objective function, minimising object's squared Euclidean distances to the centroid of the cluster they are assigned to, defines its own concept of what a cluster actually is, and will give you the corresponding clusters whatever the underlying distribution is. This can be generalised to a definition of a functional defining k-means type clusters for any underlying distributions, which Pollard's theory is about. The important question here is whether this definition of clusters in a given application is what is relevant and useful to the user. This depends on specifics of the situation, and particularly not only on the data or the data generating process, but also on the aim of clustering, and how the clusters are meant to be used. (Similar statements can by the way be made about least squares regression and many other statistical methods.)

  2. As far as I know, there is no theoretical result about k-means that states that it requires similar cluster sizes (at least not if "size" refers to the number of points; the "same covariance matrix" assumption of item 1 translates as "same spread in data space").

  3. What is important is to understand what kind of clusters k-means tend to produce. And here in fact item 1 enters again. One can say several things:

(a) k-means is based on a variance criterion that treats all variables and clusters in the same manner, meaning that within-cluster variation tends to be the same for all clusters and all variables (the latter means "spherical").

(b) Comparing the k-means objective function with objective functions based on corresponding mixture models shows that k-means in comparison favours similar cluster sizes, although it doesn't enforce them.

(c) k-means strongly avoids large distances within clusters. This particularly means that if the data has groups with strong separation, k-means will find them (provided k is specified correctly), even if they are not spherical and/or have strongly different numbers of points. What this also means is that clusters will tend to be compact, i.e., not have large within-cluster distances, even if there are connected subsets in the data that spread widely and do have such large distances within them.

(d) One additional way to understand k-means is that it provides a Voronoi tesselation of the data space.

  1. As normally in cluster analysis data don't come with the clusters known, it is very hard to check the assumptions automatically and formally. Particularly, if you run k-means and indeed find clusters that are not of similar sizes or not spherical, this doesn't necessarily mean that anything has gone wrong for the reasons stated above.

  2. It is important to make sure that the k-means objective function makes sense for the specific clustering problem at hand (aim and use of clustering).

  3. The best data-based diagnoses are in my view visual, scatterplots, maybe with dimension reduction or rotation. k-means is not good if then cluster structures of interest can be seen that disagree with the found clusters. Depending on the clustering aim, it may also not good if the found clusters turn out to be not separated from each other (a lower number of clusters may be advisable in such a case, or no clustering at all).

  4. As k-means doesn't scale variables (related to "sphericity"), k-means is not advisable with variables that have different and unrelated measurement units, although it may be fine after variables have been standardised.

PS: I add comments on two issues.

(a) Some say that "a key assumption of k-means is that the number of clusters has to be known." Not really. There are several methods in the literature that allow to fit k-means for different numbers of clusters and then will decide which one is best, either by optimising a validity criterion or by looking for a small number of clusters after which the objective function is no longer strongly (or significantly) improved. Look for Average Silhouette Width, Calinski & Harabasz criterion, gap statistic, Sugar and James approach, prediction strength, bootstrap stability etc. Now it is true that these different approaches in many situations do not agree. But the number of clusters problem ultimately comes with the same issues as the clustering problem more generally, namely that there is no uniquely defined true number of clusters, and an appropriate number of clusters cannot be objectively decided from the data alone but requires knowledge of aim and use of the clustering. The data cannot decide, for example, how compact clusters ultimately have to be. There are methods that come with an in-build decision of the number of clusters, but what I wrote before applies to them as well. Many users like an automatic decision of the number of clusters as they don't feel confident to make decisions themselves, but this basically means that the user gives the control about this decision to some usually not well understood algorithm without any guarantee that the algorithm does something appropriate for the application at hand (and if required, such decision rules are also available for k-means).

(b) Here is an older thread discussing k-means: How to understand the drawbacks of K-means. This focuses strongly on the "drawbacks" of k-means, and there are some good examples that help to understand what k-means actually does and doesn't do. As such that thread is very useful. However, I have an issue with pretty much all answers there, which is that they seem to implicitly assume that it is clear in any given situation what "the true clusters" are, but it isn't. No formal definition is given, people just show pictures and say "what k-means do is obviously wrong here", appealing to some supposedly general human intuition what the true clusters should be. Here is an example (taken from David Robinson's answer in that thread):

Taken from David Robinson's answer in the "drawbacks" thread

So k-means gives a counterintuitive solution here and the writer claims that the correct solution should be to have one cluster for the points around the outer circle and the other one for the inner point cloud. But no definition of a clustering has been given according to which this is the "truth". There is no scientific basis for this. True is that such a solution is desired in certain applications, particularly when mimicking human pattern recognition. But this is not what clustering always is about. Particularly, in some applications large within cluster distances are not acceptable (for example when clustering is used to represent small compact data subsets by their centroid for efficient information reduction), and then the supposedly "correct" clustering doesn't work at all (although the 2-means solution is arguably not much better, and a larger number of clusters should be fitted that respects the separation between outer circle and inner point cloud, which can well be done with k-means). The baseline of this is that k-means has a certain behaviour that may not be appropriate in a given situation, but this doesn't mean it's "wrong" in any objective sense.

Related Question