Solved – Purpose of dendrogram and hierarchical clustering

clusteringdendrogramhierarchical clustering

This is likely a very naive question. I've lately been reading about hierarchical clustering algorithms, and various discussions about how to interpret dendrograms or find optimal heights for cutting a dendrogram. I've also played around with some hierarchical clustering software in Python.

My question is: what exactly can a dendrogram tell you that you couldn't find out from a simple distance matrix?

Firstly, computing a dendrogram is very costly. Most linkage methods (such as group average linkage) have an algorithmic complexity of O(N^2) logN, which any comp-sci student can tell you, is pretty horrible in terms of scalability. Even a dataset with something like only 1,000 datapoints could be relatively slow on modern computing hardware.

Secondly, it seems that once you actually have the dendrogram, it's not always clear exactly how to cut it. However, generally you want to try and find clusters with lower distance values. Intuitively, clusters at heights with very high distance values are likely not very significant.

So with all that said, what exactly can a dendrogram tell you that you couldn't ascertain with a simple distance matrix? By that I mean, instead of constructing a dendrogram, simply compute a NxN distance matrix in O(N^2) time, then iterate over the matrix and find distance values which fall within some predefined range or threshold. This will give you a list of all points which are "close" to each other, and produce a single cluster of likely-related objects.

So, my question is, what can a dendrogram tell you that may potentially be more informative, significant or useful than a simple distance matrix?

Best Answer

What threshold would you use on the matrix approach?

That is what (for single-linkage, the other linkages are much more interesting) the height is: a distance threshold.

The hierarchical clustering algorithms are working on the distance matrix just as you suggest, and looking exactly for such distance patterns. But the matrix has size O(n^2) (which is pretty horrible, and most tools will run out of memory at around 50000 to 65535 objects) and if you only need to pass over the matrix log n times, then you have O(n^2 log n) complexity.

Now if you don't know the distance threshold yet - that is where the dendrogram gets nice. A dendrogram only has O(n) values. It's a condensed visual representation of the distances to help you choose the threshold (and there are also approaches that use multiple thresholds, or different thresholds in different parts of the data set!) You certainly won't want to present the O(n^2) distance matrix to the user and ask him to pick the threshold based on that?!

P.S. Single-linkage can be done in O(n^2), with less memory than a distance matrix. I believe it can even be done in O(n log n) under certain prerequisites.

Related Question