Solved – Inversions in hierarchical clustering

clusteringr

I'm using heatmap.2 to cluster my data, using the centroid method for clustering and the maximum method for calculating the distance matrix:

library("gplots")
library("RColorBrewer")

test           <- matrix(c(0.96, 0.07, 0.97, 0.98, 
                           0.50, 0.28, 0.29, 0.77, 
                           0.08, 0.96, 0.51, 0.51, 
                           0.14, 0.19, 0.41, 0.51), ncol=4, byrow=TRUE)
colnames(test) <- c("Exp1","Exp2","Exp3","Exp4")
rownames(test) <- c("Gene1","Gene2","Gene3", "Gene4")
test           <- as.table(test)
mat            <- data.matrix(test)

heatmap.2(mat, dendrogram="row", Rowv=TRUE, Colv=FALSE, 
          distfun=function(x) dist(x, method='maximum'),
          hclustfun=function(x) hclust(x, method='centroid'),
          xlab=NULL, ylab=NULL, key=TRUE, keysize=1, trace="none", 
          density.info=c("none"), margins=c(6, 12), col=bluered)

This gives a heatmap with inversions in the cluster tree, which is inherent to the centroid method. A solution to avoid inversions is to use the Euclidean or the city-block distance, and indeed if you change maximum to Euclidean in the above example the inversions are gone (for reference see chapter 4.1.1 in this link).

Now as for my problem, when I use my actual data instead of this example table the inversions are still there when I change to Euclidean. The R code is exactly the same as in this example, only the data is different. When I use cluster 3.0 and java treeview with the Euclidean and centroid method there are no inversions in my data as expected. So why does R give inversions? The theory and other software says it shouldn't.

Update: This is an example were changing maximum to Euclidean does not fix inversions (as opposed to the above example were it did fix it)

library("gplots")
library("RColorBrewer")

test           <- matrix(c(0.96, 0.07, 0.97, 0.98, 0.99, 0.50, 
                           0.28, 0.29, 0.77, 0.78, 0.08, 0.96, 
                           0.51, 0.51, 0.55, 0.14, 0.19, 0.41, 
                           0.51, 0.40, 0.97, 0.98, 0.99, 0.50, 
                           0.28                               ), ncol=6, byrow=TRUE)
colnames(test) <- c("Exp1", "Exp2", "Exp3", "Exp4", "Exp5", "Exp6")
rownames(test) <- c("Gene1", "Gene2", "Gene3", "Gene4")
test           <- as.table(test)
mat            <- data.matrix(test)

heatmap.2(mat, dendrogram="row", Rowv=TRUE, Colv=FALSE, 
          distfun=function(x) dist(x, method='maximum'),
          hclustfun=function(x) hclust(x, method='centroid'),
          xlab=NULL, ylab=NULL, key=TRUE, keysize=1, trace="none", 
          density.info=c("none"), margins=c(6, 12), col=bluered)

Best Answer

Even if changing the metric might be a solution sometimes, the problem - as you noticed - is cause by the method (centroid method in your case). So the appropriate solution is to choose a different method, if this issue occurs.

Inversions do not appear, if you use a monotonic method for clustering

Essential monotonic hierarchical clustering methods are

  • Single Linkage
  • Complete Linkage
  • Average Linkage
  • Weighted Average Linkage
  • WARD's Linkage

One can choose a method out of these depending on which suits the given problem most

Related Question