Let us have square symmetric matrix of squared euclidean distances $\bf D$ between $n$ points and vector lengthed $n$ indicating cluster or group membership ($k$ clusters) of the points; a cluster may consist of $\ge1$ point.
What is the most efficient or really efficient (in terms of speed) way to compute distances between the cluster centroids here?
So far I always did Principal Coordinate analysis in this situation. PCoA, or Torgerson's MDS amounts to first converting $\bf D$ into the matrix of scalar products $\bf S$ ("double centering") and then performing PCA of it. This way we create coordinates for the $n$ points in the euclidean space they span. After that, it is easy to compute distances between the centroids the usual way – as you would do it with grouped points x variables
data. PCoA has to do eigen-decomposition or SVD of the n x n
symmetric positive semidefinite $\bf S$, but $n$ can be quite big. In addition, the task is not a dimensionality reduction one and we don't actually need those orthogonal principal axes. So I have a feeling that these decompositions might be an overkill.
So, do you have knowledge or ideas about a potentially faster way?
Best Answer
Let the points be indexed $x_1, x_2, \ldots, x_n$, all of them in $\mathbb{R}^d$. Let $\mathcal{I}$ be the indexes for one cluster and $\mathcal{J}$ the indexes for another cluster. The centroids are
$$c_\mathcal{I} = \frac{1}{|\mathcal{I}|} \sum_{i\in\mathcal{I}} x_i,\ c_\mathcal{J} = \frac{1}{|\mathcal{J}|} \sum_{j\in\mathcal{J}} x_j$$
and it is desired to find their squared distance $||c_\mathcal{I} - c_\mathcal{J}||^2$ in terms of the squared distances $D_{ij} = ||x_i - x_j||^2$.
Exactly as we would break down sums of squares in ANOVA calculations, an algebraic identity is
$$||c_\mathcal{I} - c_\mathcal{J}||^2 = \frac{1}{|\mathcal{I}||\mathcal{J}|} \left(SS(\mathcal{I \cup J}) -\left(|\mathcal{I}|+|\mathcal{J}|\right) \left(\frac{1}{|\mathcal{I}|}SS(\mathcal{I}) + \frac{1}{|\mathcal{J}|}SS(\mathcal{J})\right)\right)$$
where "$SS$" refers to the sum of squares of distances between each point in a set and their centroid. The polarization identity re-expresses this in terms of squared distances between all points:
$$SS(\mathcal{K}) = \frac{1}{2}\sum_{i,j\,\in\,\mathcal{K}} ||x_i - x_j||^2 = \sum_{i\lt j\,\in\,\mathcal{K}} D_{ij}.$$
The computational effort therefore is $O((|\mathcal{I}|+|\mathcal{J}|)^2)$, with a very small implicit constant. When the clusters are approximately the same size and there are $k$ of them, this is $O(n^2/k^2)$, which is directly proportional to the number of entries in $D$: that would be the best one could hope for.
R
code to illustrate and test these calculations follows.