Solved – Purpose of dendrogram and hierarchical clustering

clusteringdendrogramhierarchical clustering

This is likely a very naive question. I've lately been reading about hierarchical clustering algorithms, and various discussions about how to interpret dendrograms or find optimal heights for cutting a dendrogram. I've also played around with some hierarchical clustering software in Python.

My question is: what exactly can a dendrogram tell you that you couldn't find out from a simple distance matrix?

Firstly, computing a dendrogram is very costly. Most linkage methods (such as group average linkage) have an algorithmic complexity of O(N^2) logN, which any comp-sci student can tell you, is pretty horrible in terms of scalability. Even a dataset with something like only 1,000 datapoints could be relatively slow on modern computing hardware.

Secondly, it seems that once you actually have the dendrogram, it's not always clear exactly how to cut it. However, generally you want to try and find clusters with lower distance values. Intuitively, clusters at heights with very high distance values are likely not very significant.

So with all that said, what exactly can a dendrogram tell you that you couldn't ascertain with a simple distance matrix? By that I mean, instead of constructing a dendrogram, simply compute a NxN distance matrix in O(N^2) time, then iterate over the matrix and find distance values which fall within some predefined range or threshold. This will give you a list of all points which are "close" to each other, and produce a single cluster of likely-related objects.

So, my question is, what can a dendrogram tell you that may potentially be more informative, significant or useful than a simple distance matrix?

Best Answer

What threshold would you use on the matrix approach?

That is what (for single-linkage, the other linkages are much more interesting) the height is: a distance threshold.

The hierarchical clustering algorithms are working on the distance matrix just as you suggest, and looking exactly for such distance patterns. But the matrix has size O(n^2) (which is pretty horrible, and most tools will run out of memory at around 50000 to 65535 objects) and if you only need to pass over the matrix log n times, then you have O(n^2 log n) complexity.

Now if you don't know the distance threshold yet - that is where the dendrogram gets nice. A dendrogram only has O(n) values. It's a condensed visual representation of the distances to help you choose the threshold (and there are also approaches that use multiple thresholds, or different thresholds in different parts of the data set!) You certainly won't want to present the O(n^2) distance matrix to the user and ask him to pick the threshold based on that?!

P.S. Single-linkage can be done in O(n^2), with less memory than a distance matrix. I believe it can even be done in O(n log n) under certain prerequisites.

Methods overview

Short reference about some linkage methods of hierarchical agglomerative cluster analysis (HAC).

Basic version of HAC algorithm is one generic; it amounts to updating, at each step, by the formula known as Lance-Williams formula, the proximities between the emergent (merged of two) cluster and all the other clusters (including singleton objects) existing so far. There exist implementations not using Lance-Williams formula. But using it is convenient: it lets one code various linkage methods by the same template.

The recurrence formula includes several parameters (alpha, beta, gamma). Depending on the linkage method, the parameters are set differently and so the unwrapped formula obtains a specific view. Many texts on HAC show the formula, its method-specific views and explain the methods. I would recommend articles by Janos Podani as very thorough.

The room and need for the different methods arise from the fact that a proximity (distance or similarity) between two clusters or between a cluster and a singleton object could be formulated in many various ways. HAC merges at each step two most close clusters or points, but how to compute the aforesaid proximity in the face that the input proximity matrix was defined between singleton objects only, is the problem to formulate.

So, the methods differ in respect to how they define proximity between any two clusters at every step. "Colligation coefficient" (output in agglomeration schedule/history and forming the "Y" axis on a dendrogram) is just the proximity between the two clusters merged at a given step.

Method of single linkage or nearest neighbour. Proximity between two clusters is the proximity between their two closest objects. This value is one of values of the input matrix. The conceptual metaphor of this build of cluster, its archetype, is spectrum or chain. Chains could be straight or curvilinear, or could be like "snowflake" or "amoeba" view. Two most dissimilar cluster members can happen to be very much dissimilar in comparison to two most similar. Single linkage method controls only nearest neighbours similarity.
Method of complete linkage or farthest neighbour. Proximity between two clusters is the proximity between their two most distant objects. This value is one of values of the input matrix. The metaphor of this build of cluster is circle (in the sense, by hobby or plot) where two most distant from each other members cannot be much more dissimilar than other quite dissimilar pairs (as in circle). Such clusters are "compact" contours by their borders, but they are not necessarily compact inside.
Method of between-group average linkage (UPGMA). Proximity between two clusters is the arithmetic mean of all the proximities between the objects of one, on one side, and the objects of the other, on the other side. The metaphor of this build of cluster is quite generic, just united class or close-knit collective; and the method is frequently set the default one in hierarhical clustering packages. Clusters of miscellaneous shapes and outlines can be produced.
Simple average, or method of equilibrious between-group average linkage (WPGMA) is the modified previous. Proximity between two clusters is the arithmetic mean of all the proximities between the objects of one, on one side, and the objects of the other, on the other side; while the subclusters of which each of these two clusters were merged recently have equalized influence on that proximity – even if the subclusters differed in the number of objects.
Method of within-group average linkage (MNDIS). Proximity between two clusters is the arithmetic mean of all the proximities in their joint cluster. This method is an alternative to UPGMA. It usually will lose to it in terms of cluster density, but sometimes will uncover cluster shapes which UPGMA will not.
Centroid method (UPGMC). Proximity between two clusters is the proximity between their geometric centroids: [squared] euclidean distance between those. The metaphor of this build of cluster is proximity of platforms (politics). Like in political parties, such clusters can have fractions or "factions", but unless their central figures are apart from each other the union is consistent. Clusters can be various by outline.
Median, or equilibrious centroid method (WPGMC) is the modified previous. Proximity between two clusters is the proximity between their geometric centroids ([squared] euclidean distance between those); while the centroids are defined so that the subclusters of which each of these two clusters were merged recently have equalized influence on its centroid – even if the subclusters differed in the number of objects. Name "median" is partly misleading because the method doesn't use medians of data distributions, it is still based on centroids (the means).
Ward’s method, or minimal increase of sum-of-squares (MISSQ), sometimes incorrectly called "minimum variance" method. Proximity between two clusters is the magnitude by which the summed square in their joint cluster will be greater than the combined summed square in these two clusters: $SS_{12}-(SS_1+SS_2)$. (Between two singleton objects this quantity = squared euclidean distance / $2$.) The metaphor of this build of cluster is type. Intuitively, a type is a cloud more dense and more concentric towards its middle, whereas marginal points are few and could be scattered relatively freely.

Some among less well-known methods (see Podany J. New combinatorial clustering methods // Vegetatio, 1989, 81: 61-77.) [also implemented by me as a SPSS macro found on my web-page]:

Method of minimal sum-of-squares (MNSSQ). Proximity between two clusters is the summed square in their joint cluster: $SS_{12}$. (Between two singleton objects this quantity = squared euclidean distance / $2$.)
Method of minimal increase of variance (MIVAR). Proximity between two clusters is the magnitude by which the mean square in their joint cluster will be greater than the weightedly (by the number of objects) averaged mean square in these two clusters: $MS_{12}-(n_1MS_1+n_2MS_2)/(n_1+n_2) = [SS_{12}-(SS_1+SS_2)]/(n_1+n_2)$. (Between two singleton objects this quantity = squared euclidean distance / $4$.)
Method of minimal variance (MNVAR). Proximity between two clusters is the mean square in their joint cluster: $MS_{12} = SS_{12}/(n_1+n_2)$. (Between two singleton objects this quantity = squared euclidean distance / $4$.).

Still other methods represent some specialized set distances. HAC algorithm can be based on them, only not on the generic Lance-Williams formula; such distances include, among other: Hausdorff distance and Point-centroid cross-distance (I've implemented a HAC program for SPSS based on those.)

First 5 methods described permit any proximity measures (any similarities or distances) and results will, naturally, depend on the measure chosen.

Next 6 methods described require distances; and fully correct will be to use only squared euclidean distances with them, because these methods compute centroids in euclidean space. Therefore distances should be euclidean for the sake of geometric correctness (these 6 methods are called together geometric linkage methods). At worst case, you might input other metric distances at admitting more heuristic, less rigorous analysis. Now about that "squared". Computation of centroids and deviations from them are most convenient mathematically/programmically to perform on squared distances, that's why HAC packages usually require to input and are tuned to process the squared ones. However, there exist implementations - fully equivalent yet a bit slower - based on nonsquared distances input and requiring those; see for example "Ward-2" implementation for Ward's method. You should consult with the documentation of you clustering program to know which - squared or not - distances it expects at input to a "geometric method" in order to do it right.

Methods MNDIS, MNSSQ, and MNVAR require on steps, in addition to just update the Lance-Williams formula, to store a within-cluster statistic (which depends on the method).

Methods which are most frequently used in studies where clusters are expected to be solid more or less round clouds, - are methods of average linkage, complete linkage method, and Ward's method.

Ward's method is the closest, by it properties and efficiency, to K-means clustering; they share the same objective function - minimization of the pooled within-cluster SS "in the end". Of course, K-means (being iterative and if provided with decent initial centroids) is usually a better minimizer of it than Ward. However, Ward seems to me a bit more accurate than K-means in uncovering clusters of uneven physical sizes (variances) or clusters thrown about space very irregularly. MIVAR method is weird to me, I can't imagine when it could be recommended, it doesn't produce dense enough clusters.

Methods centroid, median, minimal increase of variance – may give sometimes the so-called reversals: a phenomenon when the two clusters being merged at some step appear closer to each other than pairs of clusters merged earlier. That is because these methods do not belong to the so called ultrametric. This situation is inconvenient but is theoretically OK.

Methods of single linkage and centroid belong to so called space contracting, or “chaining”. That means - roughly speaking - that they tend to attach objects one by one to clusters, and so they demonstrate relatively smooth growth of curve “% of clustered objects”. On the contrary, methods of complete linkage, Ward’s, sum-of-squares, increase of variance, and variance commonly get considerable share of objects clustered even on early steps, and then proceed merging yet those – therefore their curve “% of clustered objects” is steep from the first steps. These methods are called space dilating. Other methods fall in-between.

Flexible versions. By adding the additional parameter into the Lance-Willians formula it is possible to make a method become specifically self-tuning on its steps. The parameter brings in correction for the being computed between-cluster proximity, which depends on the size (amount of de-compactness) of the clusters. The meaning of the parameter is that it makes the method of agglomeration more space dilating or space contracting than the standard method is doomed to be. Most well-known implementation of the flexibility so far is to average linkage methods UPGMA and WPGMA (Belbin, L. et al. A Comparison of Two Approaches to Beta-Flexible Clustering // Multivariate Behavioral Research, 1992, 27, 417–433.).

Dendrogram. On a dendrogram "Y" axis, typically displayed is the proximity between the merging clusters - as was defined by methods above. Therefore, for example, in centroid method the squared distance is typically gauged (ultimately, it depends on the package and it options) - some researchers are not aware of that. Also, by tradition, with methods based on increment of nondensity, such as Ward’s, usually shown on the dendrogram is cumulative value - it is sooner for convenience reasons than theoretical ones. Thus, (in many packages) the plotted coefficient in Ward’s method represents the overall, across all clusters, within-cluster sum-of-squares observed at the moment of a given step. Don't miss to read the documentation of your package to find out in which form the particular program displays colligation coefficient (cluster distance) on its dendrogram.

One should refrain from judging which linkage method is "better" for his data by comparing the looks of the dendrograms: not only because the looks change when you change what modification of the coefficient you plot there - as it was just described, - but because the look will differ even on the data with no clusters.

To choose the "right" method

There is no single criterion. Some guidelines how to go about selecting a method of cluster analysis (including a linkage method in HAC as a particular case) are outlined in this answer and the whole thread therein.

Solved – “Updating” hierarchical clustering

Hierarchical clustering results are not very well updateable.

If the nearest neighbor of a new point (similar for a disappearing) point is at height h, then you should be able to keep anything below this value.

This is inherent to the design of hierachical clustering; so if you try to find an approximation for more efficient updating (I believe I have seen such a paper) then your result quality can drop to "completely wrong result" pretty much instantly.

Here's a paper on updating MST (= single linkage clustering) in O(n) when a new node arrives: http://epubs.siam.org/doi/abs/10.1137/0204032 but it won't transfer to other linkages.

Here's a simple proof:

Theorem: Updating hierarchical clustering takes at least O(n) time for linkages with runtime O(n^2) (e.g. single-link) and at least O(n^2) for linkages with runtime O(n^3) (all popular others).

Proof: Otherwise, we had a more efficient algorithm for hierarchical clustering by repeated insertion of points, which uses O(n*updatecost).

So in the general case, updating will at least be O(n^2) (assuming that the worst case O(n^3) for dendrogram construction is proven).

In many cases it may be better to label your old data with the cluster ids, then train a classifier, and rather use your classifier for new points than trying to update the dendrogram tree.

Best Answer

Related Solutions

Solved – Choosing the right linkage method for hierarchical clustering

Methods overview

To choose the "right" method

Solved – “Updating” hierarchical clustering

Related Question