Solved – Hierarchical clustering: different result when I change labels

hierarchical clusteringr

I am running hierarchical clustering with a distance matrix M_norm:

hc <- hclust(M_norm^2, method="ward.D")
plot(hc, cex=1, hang=-1)

When I use different rownames and colnames in M_norm, the resulting dendrogram changes a little bit: heights where certain branches are joined are not the same as before. The height of the final join is also different.

The order of rows and columns in the input matrix is now different, but the distances between units are the same. I understand that the order of units at the bottom of the picture can change, but how can this happen? Is the implementation of this algorithm not deterministic?

Best Answer

I think I have found an answer here: http://r.789695.n4.nabble.com/hclust-does-order-of-data-matter-td3043896.html

Generally in hierarchical clustering the result can be ambiguous if there are several distances of identical value in the dataset (or identical between-cluster distances occur when aggregating clusters). The role of the order of the data depends on how these ambiguities are resolved.

Related Solutions

Solved – Choosing the number of clusters in hierarchical agglomerative clustering

See, even hierarchical clustering needs parameters if you want to get a partitioning out. In fact, hierarchical clustering has (roughly) four parameters: 1. the actual algorithm (divisive vs. agglomerative), 2. the distance function, 3. the linkage criterion (single-link, ward, etc.) and 4. the distance threshold at which you cut the tree (or any other extraction method).

Fact is that there doesn't exist any good "push button" solution to cluster analysis. It is an explorative technique, meaning that you have to try different methods and parameters and analyze the result.

I found DBSCAN to be very usable in most cases. Yes, it has two parameters (distance threshold aka: neighbor predicate, and minpts aka core predicate) - I'm not counting the distance function separately this time, because it's really a "is neighbor of" binary predicate that is needed; see GDBSCAN.

The reason is that in many applications you can choose these values intuitively if you have understood your data well enough. E.g. when working with Geo data, distance is literatlly in kilometers, and it allows me to intuitively specify the spatial resolution. Similarly, minpts gives me an intuitive control over how "significant" a subset of observations needs to be before it becomes a cluster.

Usually, when you find DBSCAN hard to use, it is because you have not understood "distance" on your data yet. You then first need to figure out how to measure distance and what the resulting numbers mean to you. Then you'll know the threshold to use.

And in the end go and try out stuff. It's data exploration, not "return(truth);". There is not "true" clustering. There are only "obvious", "useless" and "interesting" clusterings, and these qualities cannot be measured mathematically; they are subjective to the user.

Solved – Validate dendrogram in cluster analysis: What is the meaning of cophenetic correlation coefficient

The cophenetic correlation coefficient is defined as the linear correlation between the dissimilarities $d_{ij}$ between each pair of observations $(i,j)$ and their corresponding cophenetic distances $d_{ij}^{coph}$, which is the intergroup dissimilarity at which the observations $i, j$ first merged together in the same cluster.

So you get the cophenetic correlation coefficient $CCC$ by calculating the correlation between those values. Let $D$ be the distance matrix according to $d$ and $Z$ be the distance matrix according to $d^{coph}$, $\bar{D}, \bar{Z}$ denotes the means of $d_{ij}$ and $d_{ij}^{coph}$ respectively, then

$CCC(D,Z) = Cor(D,Z) = \frac{\sum\limits_{i<j} (D_{ij} - \bar{D})(Z_{ij} - \bar{Z}) }{\sqrt{\sum\limits_{i<j} (D_{ij} - \bar{D})^2 \sum\limits_{i<j} (Z_{ij} - \bar{Z})^2 }}$

(see: Mathworks Documentation: cophenetic correlation coefficient)

This should be equal to what you have done by calculating

cor(euclidian_dist, coph)

So, I think your assumption is correct.

Best Answer

Related Solutions

Solved – Choosing the number of clusters in hierarchical agglomerative clustering

Solved – Validate dendrogram in cluster analysis: What is the meaning of cophenetic correlation coefficient

Related Question