Solved – Why not terminating the k-means clustering algorithm after one iteration

clusteringhierarchical clusteringk-means

Does anybody know whether there are applications of the k-means algorithm with only one iteration? (Of course, you may feel inclined to not call it k-means anymore in that case.)

There is a clear motivation for stopping after one iteration: the centroids of the clusters are easily interpretable, because you can choose them in a way that they are easily interpretable.

There are clear caveats as well: the resulting cluster centroids, which are the values you have chosen initially (the "seeds"), won't have the property that they are the geometrical centers of the clusters. Relatedly, the results may be bad with regard to the "k-means objective function", which aims at placing centroids in the space such that the average distance of data points to their centroids is minimal.

I understand that for certain applications, the caveats are very severe. If I have $n$ cities (with equal populations) in the plane and I must decide where to put the $k < n$ rescue helicopter stations, I may want to place them in a way that the average distance between city and rescue helicopter station is minimal, so that the average time needed to fly to a patient is minimal too. Makes sense.

However, if it is just about data analysis, important competitors to k-means are the so-called hierarchical clustering procedures, which do not lead to clusterings that have any of the properties of k-means. In particular, a hierarchical clustering may lead to very bad values of the "k-means objective function" (assuming that in the hierarchically generated clusters one would define the centroids ex post as their geometric centers).

Hence, what is bad about taking the k-means clusters after the first iteration? Compared to hierarchical clustering, this has the advantages that (a) the clusters have centroids, yielding a nice way to interpret the clusters in terms of Euclidean distance, (b) one does not have to bother about the ratio of cluster variables, number of clusters, and data points, as the sample is partitioned regardless of those values, and each data point is uniquely assigned to one cluster. Does anybody know about practical studies where k-means was stopped after one iteration?

Best Answer

What you are discussing sounds more like nearest neighbor classification to me, i.e. assign every point to the nearest "training" object. Essentially, you are describing classification, not clustering.

You appear to be doing only "half an iteration". A full iteration would also update the cluster centroids. The task of clustering is to discover structure in the data - in the case of k-means the structure are the centroids. In other algorithms it is usually the set of points that makes up the cluster.

Now to the "helipad" example. K-means does not minimize average distances. It minimizes the average squared distance. For a purely cost-driven approach that is worse, centroid-linkage is maybe better. For the helipad emergency it's better to have fewer extreme distance; but one may want to optimize the maximum instead. That is easier to formuate with hierarchial clustering, e.g. using minimax linkage. Furthermore, k-means may be placing the centers in the ocean, or in a lake... (this does not happen with minimax) if you had to connect all point with a point-to-point radio network, single-linkage is meaningful, because it is the minimum spanning tree. And Ward linkage optimizes squared deviations, just like k-means. This shows a key benefit of hierarchical clustering: it's flexibility. It allows both different objectives (linkages) and arbitrary distance functions (e.g. haversine distance). K-means does neither.