Solved – Is it true that K-Means has an assumption “each cluster has a roughly equal number of observations”

assumptionsclusteringdata miningk-meansmachine learning

A lecturer claimed in a recent class that "K-means assumes that each cluster includes a roughly equal number of observations."

However, when I searched online, there is conflicting information regarding this point.

The answer to this question claims that the "size" in K-Means refers to the area, not cardinality.

The answer to this question, however, explicitly claims that "size" in K-Means refers to "the amount of points in a cluster", not the "spread"!

This highly upvoted question also didn't help. The question included a statement that K-Means has the following assumption:

the prior probability for all k clusters is the same, i.e., each cluster has roughly equal number of observations

Unfortunately, the answers to that question didn't seem to directly address that "assumption" at all. So I still don't know if it's true or false.

Wikipedia says

k-means clustering and EM clustering on an artificial dataset ("mouse"). The tendency of k-means to produce equal-sized clusters leads to bad results, while EM benefits from the Gaussian distribution present in the data set

which further adds to the confusion.

The answers to the first two questions seem to directly contradict each other. So one of them must be false/inaccurate?

What exactly is K-Means' relationship with "equal number of observations in each cluster"? Is it an assumption? A tendency in the result? Or neither?

Best Answer

There is certainly no assumption in standard K-means algorithms that assumes an equal number of points in each cluster. However, certain standard algorithms do have a tendency towards equalising the spatial variance of clusters, which can result in a (rough) tendency towards equality of cluster sizes in cases where there is overlap between the clusters. For example, one standard method is to estimate the clusters by minimising the within-cluster sum-of-squares (WCSS). In cases where there are several overlapping clusters, this method has a tendency to allocate points in a way that (roughly) equalises the spacial variance of the clusters, which may result in (rough) equalisation of the number of points in each cluster. Alternative methods that use parametric forms to allow greater freedom of variance in each cluster will lack this tendency.