Clustering Mean Euclidean – Efficient Methods to Compute Distances Between Centroids from Distance Matrix

clusteringdistanceeuclideanmean

Let us have square symmetric matrix of squared euclidean distances $\bf D$ between $n$ points and vector lengthed $n$ indicating cluster or group membership ($k$ clusters) of the points; a cluster may consist of $\ge1$ point.

What is the most efficient or really efficient (in terms of speed) way to compute distances between the cluster centroids here?

So far I always did Principal Coordinate analysis in this situation. PCoA, or Torgerson's MDS amounts to first converting $\bf D$ into the matrix of scalar products $\bf S$ ("double centering") and then performing PCA of it. This way we create coordinates for the $n$ points in the euclidean space they span. After that, it is easy to compute distances between the centroids the usual way – as you would do it with grouped points x variables data. PCoA has to do eigen-decomposition or SVD of the n x n symmetric positive semidefinite $\bf S$, but $n$ can be quite big. In addition, the task is not a dimensionality reduction one and we don't actually need those orthogonal principal axes. So I have a feeling that these decompositions might be an overkill.

So, do you have knowledge or ideas about a potentially faster way?

Best Answer

Let the points be indexed $x_1, x_2, \ldots, x_n$, all of them in $\mathbb{R}^d$. Let $\mathcal{I}$ be the indexes for one cluster and $\mathcal{J}$ the indexes for another cluster. The centroids are

$$c_\mathcal{I} = \frac{1}{|\mathcal{I}|} \sum_{i\in\mathcal{I}} x_i,\ c_\mathcal{J} = \frac{1}{|\mathcal{J}|} \sum_{j\in\mathcal{J}} x_j$$

and it is desired to find their squared distance $||c_\mathcal{I} - c_\mathcal{J}||^2$ in terms of the squared distances $D_{ij} = ||x_i - x_j||^2$.

Exactly as we would break down sums of squares in ANOVA calculations, an algebraic identity is

$$||c_\mathcal{I} - c_\mathcal{J}||^2 = \frac{1}{|\mathcal{I}||\mathcal{J}|} \left(SS(\mathcal{I \cup J}) -\left(|\mathcal{I}|+|\mathcal{J}|\right) \left(\frac{1}{|\mathcal{I}|}SS(\mathcal{I}) + \frac{1}{|\mathcal{J}|}SS(\mathcal{J})\right)\right)$$

where "$SS$" refers to the sum of squares of distances between each point in a set and their centroid. The polarization identity re-expresses this in terms of squared distances between all points:

$$SS(\mathcal{K}) = \frac{1}{2}\sum_{i,j\,\in\,\mathcal{K}} ||x_i - x_j||^2 = \sum_{i\lt j\,\in\,\mathcal{K}} D_{ij}.$$

The computational effort therefore is $O((|\mathcal{I}|+|\mathcal{J}|)^2)$, with a very small implicit constant. When the clusters are approximately the same size and there are $k$ of them, this is $O(n^2/k^2)$, which is directly proportional to the number of entries in $D$: that would be the best one could hope for.


R code to illustrate and test these calculations follows.

ss <- function(x) {
  n <- dim(x)[2]
  i <- rep(1:n, n)
  j <- as.vector(t(matrix(i,n)))
  d <- matrix(c(1,1) %*% (x[,i] - x[,j])^2 , n) # The distance matrix entries for `x`
  sum(d[lower.tri(d)])
}
centroid <- function(x) rowMeans(x)
distance2 <- function(x,y) sum((x-y)^2)
#
# Generate two clusters randomly.
#
n.x <- 3; n.y <- 2
x <- matrix(rnorm(2*n.x), 2)
y <- matrix(rnorm(2*n.y), 2)
#
# Compare two formulae.
#
cat("Squared distance between centroids =",
    distance2(centroid(x), centroid(y)),
    "Equivalent value =", 
    (ss(cbind(x,y)) - (n.x + n.y) * (ss(x)/n.x + ss(y)/n.y)) / (n.x*n.y),
    "\n")
Related Question