Solved – How is finding the centroid different from finding the mean

clusteringmeantypes-of-averages

When performing hierarchical clustering, one can use many metrics to measure the distance between clusters. Two such metrics imply calculation of the centroids and means of data points in the clusters.

What is the difference between the mean and the centroid? Aren't these the same point in cluster?

Best Answer

As far as I know, the "mean" of a cluster and the centroid of a single cluster are the same thing, though the term "centroid" might be a little more precise than "mean" when dealing with multivariate data.

To find the centroid, one computes the (arithmetic) mean of the points' positions separately for each dimension. For example, if you had points at:

  • (-1, 10, 3),
  • (0, 5, 2), and
  • (1, 20, 10),

then the centroid would be located at ((-1+0+1)/3, (10+5+20)/3, (3+2+10)/3), which simplifies (0, 11 2/3, 5). (NB: The centroid does not have to be--and rarely is---one of the original data points)

The centroid is also sometimes called the center of mass or barycenter, based on its physical interpretation (it's the center of mass of an object defined by the points). Like the mean, the centroid's location minimizes the sum-squared distance from the other points.

A related idea is the medoid, which is the data point that is "least dissimilar" from all of the other data points. Unlike the centroid, the medoid has to be one of the original points. You may also be interested in the geometric median which is analgous to the median, but for multivariate data. These are both different from the centroid.

However, as Gabe points out in his answer, there is a difference between the "centroid distance" and the "average distance" when you're comparing clusters. The centroid distance between cluster $A$ and $B$ is simply the distance between $\text{centroid}(A)$ and $\text{centroid}(B)$. The average distance is calculated by finding the average pairwise distance between the points in each cluster. In other words, for every point $a_i$ in cluster $A$, you calculate $\text{dist}(a_i, b_1)$, $\text{dist}(a_i, b_2)$ , ... $\text{dist}(a_i, b_n)$ and average them all together.