Let's look at this example:
simple.data = data.frame(
x = c(0,1,0.5),
y = c(0,0,0.9)
)
par(mfrow = c(1,2))
plot(simple.data, xlab = "Dimension 1", ylab = "Dimension 2")
text(simple.data[1,], labels = 1, pos = 3)
text(simple.data[2,], labels = 2, pos = 3)
text(simple.data[3,], labels = 3, pos = 1)
eucl = dist(simple.data, method = "euclidean")
# 1 2
# 2 1.000000
# 3 1.029563 1.029563
agglo = hclust(eucl, method = "centroid")
cophenetic(agglo)
# 1 2
# 2 1.000000
# 3 0.779563 0.779563
par(mar = c(2,4,1,1))
plot(as.dendrogram(agglo), main = "Dendrogram", ylab = "Height", ylim = c(0,1))
-
I know that the distances between clusters can be computed using the Lance und Williams Formula, e.g. $$D(A \cup B,C)=\alpha_1 d(A,C)+\alpha_2 d(B,C)+ \beta d(A,B) + \gamma |d(A,C)-d(B,C)|.$$
with $\alpha_1 = \tfrac{|A|}{|A|+|B|}, \alpha_2 = \tfrac{|B|}{|A|+|B|}, \beta = \tfrac{|A||B|}{(|A|+|B|)^2}, \gamma = 0$ for centroid linkage (see also https://de.wikipedia.org/wiki/Hierarchische_Clusteranalyse#Lance_und_Williams_Formel). -
I also know that, in R, the dendrogram and the function
cophenetic()
computes the distances between two clusters with this formula, e.g. after merging the two closest points (in example above: point 1 and point 2), the distance between the cluster that consists of point 1 and 2 and the second cluster that only consists of point 3 is, according to the Lance und Williams Formula with $\alpha_1 = \alpha_2 = 0.5, \gamma = 0$ and $\beta = 0.25$:0.5*1.029563 + 0.5*1.029563 - 0.25*1 = 0.779563
. Therefore, the dendrogram shows a merge of those two clusters at "Height"0.779563
. -
However, since "in centroid method, the distance between two clusters is the distance between the two cluster centroids", I would have computed this distance differently, namely:
- compute centroid of point 1 and 2, which is at (0.5, 0)
- "centroid" of point 3 is point 3 itself (located at (0.5, 0.9), see plot).
- euclidean distance between the two cluster centroids is therefore
sqrt((0.5-0.5)^2+(0-0.9)^2) = 0.9
, which is not the same distance as computed with the Lance und Williams Formula.
So, my questions is: Why do we use the Lance und Williams Formula (e.g. the 0.779563
) to plot the dendrogram and do not use "the distance between the two cluster centroids", which is 0.9
?
Best Answer
According to the book Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze:
this seems this is an typical behavior.
Clearly, 1 and 2 are the closest points at distance 1. So that is the best possible merge here.
The squared Euclidean distance of the centroids is $0^2+0.9^2=0.81$.
With Lance-Williams: The squared Euclidean distance is $d(1,3)=d(2,3)=0.9^2+0.5^2 = 1.06$, while d(1,2)=1. So we get $$\frac121.06+\frac121.06-\frac141=0.81.$$
Therefore:
But don't ask me for a proof that the Lance-Williams and "direct" definition are equivalent. Apparently the proof can only work for squared Euclidean? I guess it is similar to the derivation of Ward's, which appears to be the "properly weighted" version of centroid linkage.