Solved – Comparing hierarchical clustering dendrograms obtained by different distances & methods

clusteringdendrogramdistance-functionsrsimilarities

[The initial title "Measurement of similarity for hierarchical clustering trees" was later changed by @ttnphns to better reflect the topic]

I am performing a number of hierarchical cluster analyses on a dataframe of patient records (e.g. similar to http://www.biomedcentral.com/1471-2105/5/126/figure/F1?highres=y)

I am experimenting with different distance measures, different parameter weights and different hierarcical methods, to understand their impact on the final clusters / structure/view of the tree (dendrogram). My question whether there is a standard calculation / measure to calculate the difference between different hierarchical trees, and how to implement this in R (e.g. to quantify that some trees are nearly identical, and that some are drastically different).

Best Answer

To compare the similarity of two hierarchical (tree-like) structures, measures based on cophenetic correlation idea are used. But is it correct to perform comparison of dendrograms in order to select the "right" method or distance measure in hierarchical clustering?

There are some points - hidden snags - regarding hierarchical cluster analysis that I would hold quite important:

Never compare (in order to select the method giving stronger partition) dendrograms obtained by different agglomeration methods visually. It won't tell which method is "better" at that. Each method has its own "prototypical" tree look: the trees will differ consistently even when the data have no cluster structure or have random cluster structure. (And I don' think there exist a standardization or measure that would take off these intrinsic differences.). You may, however, compare dendrogram looks of results produced by the same method but different data. Maxim: direct, appearance comparing of dendrograms after different methods is unacceptable.
Do not decide on the number of clusters (i.e. where to cut the tree) by looking at the dendrogram of Ward method. In Ward, the tree shows the growth of the summative, not the averaged, coefficient of colligation; and the consequence is that since later clusters are bigger by the number of points, the later clusters look misleadingly "better" on the tree. To standardize Ward's dendrogramm appropriately, divide the coefficient growth at each step by the overall number of points in the two clusters being combined (such standardized Ward dendrogram, though, may be hard to implement graphically).$^1$ Maxim: choosing a cut level by contemplating a dendrogram appearance, while possible, is not the best method to select the partition, and for some methods may be misleading. It is recommended to rely on some formal internal clustering criterion instead (see also here).
Albeit no-one can forbid you "experimenting" with distance measures or agglomerative methods, it is better to select the distance and the method consciously, not blind trying. The distance should reflect the aspects of difference you are interested in, and the method - one must be aware - implies a specific archetype of a cluster (e.g. the metaphor of a Ward cluster is, I would say, type; cluster after complete linkage would be circle [by hobby or plot]; cluster after single linkage would be spectrum [chain]; cluster after centroid method would be proximity of platforms [politics]; an average linkage cluster is conceptually most undifferentiated and would be generally united class).
Some methods call for right distance measures and/or right type of data. Ward and centroid, for example, logically require (squared) euclidean distance - because these methods engage in computation of centroids in euclidean space. And computation of geometric centroids is incongruous with, for example, binary data; the data should be scale/continuous. Maxim: data/distance/method assumptions and correspondence is very important and not so easy question.
Preprocessing (such as centering, scaling and other forms of transformation of variables/features) prior computation of a distance matrix and doing the clustering is extremely important question, too. It can dramatically influence the results. Think over what preprocessing may help you and will make sense from the interpretation point of view. Also, never be shy to carefully inspect you data graphically before attempting to do cluster analysis.
Not all methods of agglomerative clustering can be equally seen as giving you hierarchical classification... on philosophical grounds. For example, centroid method do gives hierarchy in a sense, because cluster centre is an emergent and defining feature of a cluster as a whole, and merging clusters is driven by that feature. Complete linkage, on the other hand, "dismisses" both subclusters when it merges them - by virtue of distancing among individual objects of the two. Thus, complete linkage dendrogram is merely a history of collection and not a parent-child sort of taxonomy. Maxim: hierarchical agglomerative cluster analysis, generally, expects that you make a partition based on its result, rather than see the result as hierarchical taxonomy.
Hierarchical clustering is typical greedy algorithm that makes the best choice among alternatives appearing on each step in the hope to get close to optimal solution in the end. However, the "best" choice appearing on a high level step is likely to be poorer than global optimum theoretically possible on that step. The greater is the step, the greater is the suboptimality, as a rule. Given that we usually want few clusters last steps are important; and, as just said, they are expected to be relatively poor if the steps' number is high (say, thousandth step). That's why hierarchical clustering is generally not recommended for large samples of objects (numbering thousands of objects) even if the program could handle such a big distance matrix.

If after the above precautions you continue to think that you want a measure of similarity between hierarchical classifications you might google on 'comparing of dendrograms' and 'comparing hierarchical classifications'. One most suggesting itself idea may be based on the cophenetic correlation: having two dendrograms for the same dataset of n objects, let $X_{ij}$ be coefficient of colligation (or maybe its rank, the step number) between every pair of objects ij in one dendrogram, and $Y_{ij}$ likewise be the same in the other dendrogram. Compute correlation or cosine.

$^1$ Later update on the problem of dendrogram of Wards's method. Different clustering programs may output differently transformed aglomeration coefficients for Ward's method. Hence their dendrograms will look somewhat differently despite that the clustering history and results are the same. For example, SPSS doesn't take the root from the ultrametric coefficients, and it cumulates them in the output. Another tradition (found in some R packages, for example) is to take the root (so called "Ward-2" implementations) and not to cumulate. To repeat again, such differences affect only the general shape/looks of the dendrogram, not the clustering results. But the looks of the dendrogram might influence your decision about the number of clusters. The moral is that it would be safe not to rely on dendrogram in Ward's method at all, unless you know exactly what are these coefficients out of your program and how to interpret them correctly.

Related Solutions

Solved – Methods for comparing clustering results

You can use the Adjusted Rand Index or the Adjusted Mutual Information to measure the similarity (agreement) of the overall results of two clustering algorithms on an overlapping dataset.

Both scores are adjusted for chance which means that 2 random clusterings will likely have an ARI or AMI close to 0.0.

Furthermore you can use those measure for model selection (e.g. finding the number of k in k-means) by running the clustering algorithm twice on 2 overlapping samples of the datasets and measuring the agreement on the overlap. The assumption is that a high agreement on the overlap means a higher stability of the algorithm and hence a better value for k (it captures better the real structure of the dataset).

A Novel Approach for Automatic Number of Clusters Detection in Microarray Data based on Consensus Clustering by Nguyen and Epps is probably the best reference for this method and it is further applied to microarray data.

Solved – Different hierarchical clustering results

You are using the kmeans function, which will not give the same exact results every time you run it.

The k-means algorithm works by using randomly chosen centroids as a starting point. These are generated using R pseudorandom number generator (PRNG).

The PRNG generates a series of random values which depend on a seed. From ?set.seed:

Initially, there is no seed; a new one is created from the current time (and since R 2.14.0, the process ID) when one is required. Hence different sessions will give different simulation results, by default. However, the seed might be restored from a previous session if a previously saved workspace is restored.

If you want to always obtain the same results you should impose a seed at the start of your script.

For instance:

set.seed(12345)

Different seeds will give different results, but once you have fixed it it will always be the same.

Now, the fact that:

They are not different in everything, but there are individual that they now belong to another cluster!

Is a good thing, it means that you can cluster most individuals with good confidence. Probably the ones that change between clusters are a bit "borderline".

One thing that you should do, however, is to set the nstart parameter in kmeans. Setting nstart to 10, for instance, will make the algorithm run 10 times, with 10 different starting sets of points and return the best fit (the one with the minimum within cluster sums-of-squares).

This will help in reducing "bad clustering" due to an "unlucky" choice of starting points.

Finally, I am not completely sure what is the point of running hclust on the kmeans results. Either run hclust directly on the original data, or just show the kmeans results.

Best Answer

Related Solutions

Solved – Methods for comparing clustering results

Solved – Different hierarchical clustering results

Related Question