Solved – Centroid linkage clustering with hclust yields wrong dendrogram

clusteringdendrogramdistancehierarchical clusteringr

Let's look at this example:

simple.data = data.frame(
  x = c(0,1,0.5),
  y = c(0,0,0.9)
)

par(mfrow = c(1,2))
plot(simple.data, xlab = "Dimension 1", ylab = "Dimension 2")
text(simple.data[1,], labels = 1, pos = 3)
text(simple.data[2,], labels = 2, pos = 3)
text(simple.data[3,], labels = 3, pos = 1)

eucl = dist(simple.data, method = "euclidean")

#         1        2
# 2 1.000000         
# 3 1.029563 1.029563

agglo = hclust(eucl, method = "centroid")
cophenetic(agglo)

#          1        2
# 2 1.000000         
# 3 0.779563 0.779563

par(mar = c(2,4,1,1))
plot(as.dendrogram(agglo), main = "Dendrogram", ylab = "Height", ylim = c(0,1))

enter image description here

  1. I know that the distances between clusters can be computed using the Lance und Williams Formula, e.g. $$D(A \cup B,C)=\alpha_1 d(A,C)+\alpha_2 d(B,C)+ \beta d(A,B) + \gamma |d(A,C)-d(B,C)|.$$
    with $\alpha_1 = \tfrac{|A|}{|A|+|B|}, \alpha_2 = \tfrac{|B|}{|A|+|B|}, \beta = \tfrac{|A||B|}{(|A|+|B|)^2}, \gamma = 0$ for centroid linkage (see also https://de.wikipedia.org/wiki/Hierarchische_Clusteranalyse#Lance_und_Williams_Formel).

  2. I also know that, in R, the dendrogram and the function cophenetic() computes the distances between two clusters with this formula, e.g. after merging the two closest points (in example above: point 1 and point 2), the distance between the cluster that consists of point 1 and 2 and the second cluster that only consists of point 3 is, according to the Lance und Williams Formula with $\alpha_1 = \alpha_2 = 0.5, \gamma = 0$ and $\beta = 0.25$: 0.5*1.029563 + 0.5*1.029563 - 0.25*1 = 0.779563. Therefore, the dendrogram shows a merge of those two clusters at "Height" 0.779563.

  3. However, since "in centroid method, the distance between two clusters is the distance between the two cluster centroids", I would have computed this distance differently, namely:

    • compute centroid of point 1 and 2, which is at (0.5, 0)
    • "centroid" of point 3 is point 3 itself (located at (0.5, 0.9), see plot).
    • euclidean distance between the two cluster centroids is therefore sqrt((0.5-0.5)^2+(0-0.9)^2) = 0.9, which is not the same distance as computed with the Lance und Williams Formula.

So, my questions is: Why do we use the Lance und Williams Formula (e.g. the 0.779563) to plot the dendrogram and do not use "the distance between the two cluster centroids", which is 0.9?

Best Answer

According to the book Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze:

centroid clustering is not monotonic. So-called inversions can occur: Similarity can increase during clustering as in the example in Figure 17.12, where we define similarity as negative distance.

this seems this is an typical behavior.

Clearly, 1 and 2 are the closest points at distance 1. So that is the best possible merge here.

The squared Euclidean distance of the centroids is $0^2+0.9^2=0.81$.

With Lance-Williams: The squared Euclidean distance is $d(1,3)=d(2,3)=0.9^2+0.5^2 = 1.06$, while d(1,2)=1. So we get $$\frac121.06+\frac121.06-\frac141=0.81.$$

Therefore:

  1. The result is correct - if you use squared Euclidean distances. I don't think you can use the (efficient!) Lance-Williams approach with other distances. Plus, the centroid makes most sense with squared errors, too.
  2. Inversions in centroid linkage do occur. Even if you would use non-squared Euclidean distance. With regular Euclidean distance, you would still merge 1,2 at distance 1, and then merge 1,2,3 at $\sqrt{0.81}=0.9$; which is less than 1.

But don't ask me for a proof that the Lance-Williams and "direct" definition are equivalent. Apparently the proof can only work for squared Euclidean? I guess it is similar to the derivation of Ward's, which appears to be the "properly weighted" version of centroid linkage.