Solved – (hierarchical) cluster analysis with non-standard distance

clusteringdistancedistance-functions

My question is triggered by a question that was asked on stackoverflow: https://stackoverflow.com/questions/12198115/using-different-metric-for-hclust-linkage.

The thing is this:

I can formulate an algorithm for hierarchical clustering with some kind of distance between original objects, and I'll also need a distance between objects and clusters of objects (linkage).

Let's say the linkage attaches a "handle" to the cluster which is a (hypothetic) object that can be used with the distance function to calculate the distance from another object (or cluster, if it has such a "handle").

For consistency, I'd use the same distance function for the initial calculation of distance between single objects and later for distance between objects or "cluster handles".

So far, so good.

What I'm wondering about is: what happens if the cluster algorithm is fed with a distance matrix instead of the data matrix (as it is e.g. the case with R's hclust).

  • How do I know which distance function on the distance matrix agrees with what distance function on the data matrix? Examples?
  • The more basic maths/statistics question underlying this is: please explain to me (chemist, no mathematician/statistician) what the meaning of e.g. Euclidean distance between rows of a distance matrix is?

Best Answer

A somewhat complicated answer to your first question about distances goes like this.

Rencher's Multivariate Analysis cites Lance and Williams (1967) who proposed a general (flexible beta) method underlying most of the existing hierarchical cluster analysis: for three objects A, B and C, be that clusters or individual points considered to be clusters of size 1, if clusters A and B are joined into AB, then the new distance from AB to C can be expressed by $$ d(AB,C) = \alpha_A d(C,A) + \alpha_B d(C,B) + \beta d(A,B) + \gamma| d(C,A) - d(C,B) |, $$ They suggested $\alpha_A + \alpha_B + \beta=1$, $\alpha_A = \alpha_B$, $\gamma=0$ and $\beta < 1$, although you could try other choices. E.g., single linkage is $\alpha_A = \alpha_B = 1/2$, $\beta=0$, $\gamma=-1/2$ (contracting the space), complete linkage is the same except for $\gamma=1/2$ (diluting the space), and Ward's method is $\alpha_A = (n_A + n_C)/(n_A + n_B + n_C)$, $\alpha_B = (n_B + n_C)/(n_A + n_B + n_C)$, $\beta= - n_C/(n_A + n_B + n_C)$, and $\gamma=0$ (somewhat space-contracting). The book discusses the properties of the algorithms, such as monotonicity and space contraction/dilution as functions of these parameters.

So an algorithm can be build just based on the distances, and does not need the data matrix. However, the distance matrix for typical problems (sample size $n$ > dimension of the data $p$) takes more memory than the data matrix, so this may not be a very efficient way of approaching the problem, unless of course the data matrix is all you have.