Solved – How to calculate the variance of vectors for clustering

clusteringvariance

I'm interested in various methods of measuring dispersion of vectors mainly for use in cluster analysis. I can think of three methods:

  1. Find the mean vector (centroid), then calculate the variance of the distances of all vectors to this mean vector. It is possible that the set of vectors could all be different, but have the same distance to the mean vector. In this case this would not appear to be a great measure, though this situation may be unlikely in practice. It seems that the Davies-Bouldin cluster quality measure uses this to measure intra-cluster quality.
  2. Use the mean pairwise distance between vectors. I've seen this used to measure both intra and inter cluster quality. This would seem to require some sort of distance matrix. The implementation may be difficult if one tries to add or remove vectors and update on the distance matrix on the fly.
  3. Calculate the population variance for each component of the vectors. This would result in a vector containing the population variance for each component. Then take the sum of the components in this vector.

My questions:

  • Any thoughts on these measures?
  • Any other good measures?
  • Also does anyone know a one pass algorithm for calculating #1 and #2.

I know how to compute #3 with a numerically stable one pass algorithm. Essentially every time I add or remove a vector from a cluster I would like the measure of quality to update automatically. I've had some luck with this for certain measures.

Best Answer

Note that not all clustering algorithms assume spherical clusters. All the measures you describe do not seem too sensible for non-convex clusters, say, banana-shaped clusters; a common concept in density based clustering. In this example, the mean is not even inside the cluster. Variances mostly measure the spatial extend of the cluster, not its connectivity and similar properties...

Related Question