Solved – Can sub-optimality of various hierarchical clustering methods be assessed or ranked

algorithmsclusteringhierarchical clusteringoptimization

Classic agglomerative hierarchical clustering methods are based on a greedy algorithm. This means that they (many of them) are prone to give sub-optimal solutions instead of the global optimum result, especially on later steps of agglomeration. To clarify: each of the agglomerative methods makes the best choice of which two clusters to merge on a given step $q$, the choice that minimizes (to value $\delta_q$) the coefficient of colligation $\delta$ on the step; however, it is not impossible that if the algo had chosen not the best but a bit worse choice on some previous step(s) it then would have been able to minimize the coefficient on step $q$ to value less than that $\delta_q$ while preserving monotonical growth of $\delta$. Globally optimal solution on step $q$ corresponds to the absolutely minimal such way achievable value $\delta_q$ with the restriction remaining that $\delta_q \ge \delta_{q-1} \ge \delta_{q-2} \cdots$.

Risk of sub-optimality is the main reason why hierarchical clustering is commonly not recommended with large amount of objects (say, more than several hundreds): the researcher typically wants few clusters and therefore looks at later steps, but if a later step is, say, a 1000th one, it is more suspicious for having gone astray from globally optimal partition possible on the 1000th step than, say, a 100th step gone from globally optimal partition possible on the 100th step. This looks a matter of dictum.

My question is what do you think about which among well-known agglomerative methods – single linkage, complete linkage, average linkage (within groups), average linkage (between groups), centroid, median, Ward's sum-of-squares – I'm mentioning those in SPSS, but there are other similar variants as well – are more and which are less prone to the sub-optimality flaw as the step number grows. Intuition suggests to me that single or complete linkage aren't prone at all and always give their globally best solutions, because these methods don't engage in computation of any statistics derivative from raw distances (e.g. centroids). I may be right as well as not. What are about the rest methods? Can someone here attempt to analyze the (relative) degree of local-optimum risk of the above concrete greedy algorithms?

Illustration. Fortunately, for Ward method we have clues to observe how sub-optimality accumulates as $q$ grows. Because there is a well-known iterative method that attempts to optimize the same function $\delta$ as Ward does; this is K-means clustering: both try to minimize pooled within-cluster sum-of-squares $^1$. We can do Ward clustering and on each step save cluster centres, and use those as initial centres in K-means procedure. Will K-means improve the Ward's solutions in terms of sum-of-squares?

[$^1$ Footnote: Ward minimizes the increase in sum-of-squares on its steps. Another hierarchical method, variance method MNVAR (see Podany, J., 1989, for various hierarchical methods), which is less well-known, minimizes mean-of-squares within clusters. Both these methods can be considered as a kin to K-Means, with Ward's being somewhat better minimizer of final SSwithin than MNVAR (my own simulation studies).]

Note. This illustration tests Ward's observed $\delta_q$ values not against its own global optimum values which we can't know because we are unable to try, for each step, all innumerable alternative reconstructions of earlier steps in order to find out the minimal possible $\delta_q$; it tests against K-means optimality (which is almost its global here because initial centres provided by Ward can be regarded reasonably good start for iterations).

Here is some data (5 not well separated clusters, 183 points). The data underwent both Ward clustering session and K-means clustering sessions, as described.

enter image description here

Results (value of $\delta$) for clusters 50 through 2 are shown below. Although the 2 curves are close to each other, K-means' values are somewhat better. That is, K-means is more optimal than Ward.

enter image description here

The last picture plots $\delta^{Ward}/\delta^{Kmeans}$. The trend is clearly seen: as the number of clusters diminishes (that is, step $q$ for Ward grows), Ward tends to become less optimal relative to K-means.

enter image description here

One tempting question would be whether the relative sub-optimality of Ward (the value of $\delta^{Ward}/\delta^{Kmeans}$) on these latest 49 steps of agglomeration would be higher if I analyzed in the same fashion 1830 data points instead of just 183.

Best Answer

Only single-linkage is optimal. Complete-linkage fails, so your intuition was wrong there. ;-)

As a simple example, consider the one-dimensional data set 1,2,3,4,5,6.

The (unique) optimum solution for complete linkage with two clusters is (1,2,3) and (4,5,6) (complete linkage height 2). The (unique) optimum solution with three clusters is (1,2), (3,4), (5,6), at linkage height 1. There exists no agglomerative hierarchy that contains both, obviously, because these partitions do not nest. Hence, no hierarchical clustering contains all optimal solutions for complete linkage. I would even expect complete linkage to exhibit among the worst gap between the optimum solution and the agglomerative solution at high levels, based on that example... An this example likely is a counter example for most other schemes (except single linkage, where this data set degenerates).

You may be interested in studying this article:

Gunnar E. Carlsson, Facundo Mémoli: Characterization, Stability and Convergence of Hierarchical Clustering Methods. Journal of Machine Learning Research 11: 1425-1470 (2010)

and the references contained therein, in particular also:

Kleinberg, Jon M. "An impossibility theorem for clustering." Advances in neural information processing systems. 2003.

These seem to indicate that given some mild stability requirements (which can be interpreted as uncertainty of our distance measurements), only single-linkage clustering remains. Which, unfortunately, is all but usable in most cases... so IMHO for practical purposes the authors overshoot a bit, neglecting usefulness of the result over pretty theory.