Measure overlap of cluster in higher dimensions

clusteringeuclidean-geometrynormal distribution

I would like to describe how good an algorithm is clustering my data. For that I would like to create a measure on how much two point-clusters overlap or are away from each other.

Take for example a two dimensional example. Basically I am most interested in the area of the intersections seen in the second graph. Unfortunately it is not that simple to create an n-dimensional "figure" without introducing noise. (or is there a way?)

Another approach would be to look at the distributions of the points which are assumed to be normal distributed. We can create a multivariate (for each dimension) normal distribution for each cluster. Could I use the Bhattacharyya distance for multivariate distributions to calculate the similarity and take that as a measure? I don't know if this is even a good approximation for the "overlap".

Of course we could also look at the points instead of the distributions. For this approach I have no idea yet.

enter image description here

Best Answer

I would like to describe how good an algorithm is clustering my data.

This is itself a common problem in clustering, so you could use some of the metrics already developed. The idea is to ensure clusters are cohesive (intra-cluster distances are small) and distinctive (inter-cluster differences are large). Personally I've used the Davies-Bouldin Index and the Silhouette index before. Check out this thread and this one for more measures.

Take for example a two dimensional example. Basically I am most interested in the area of the intersections seen in the second graph. Unfortunately it is not that simple to create an n-dimensional "figure" without introducing noise. (or is there a way?)

Since you have only a point set rather than some kind of $n$-dimensional polygon, it seems natural to measure the overlap in terms of the empirical probability distribution implied by the point set, rather than literally the volumetric overlap of the point sets. (This is what your Bhattacharyya coefficient idea does).

Nevertheless, if you want to try to literally measure the area overlap geometrically, we can try the following. The idea is measure the overlap of the convex hulls of the clusters. Since high-dimensional convex hull computation is non-trivial and also non-visualizable, let's use a manifold learning algorithm to do dimensionality reduction first, specifically say random projections. Basically, the overlap would be $$ V(C_1,C_2) = \frac{1}{R}\sum_{r=1}^{R} \text{AreaOfOverlap}(\text{ConvexHull}(R_r(C_1)), \text{ConvexHull}(R_r(C_2))) $$ where $R_r$ is the $r$th random projection. Note this is stochastic, but geometric rather than probabilistic. Assuming $R_r$ maps to 2D, we can use a standard convex hull calculator. To compute the overlap area, a simple way is to "grid" the plane, and then "rasterize" the convex hulls to create two binary images (like pixelized projection silhouettes), then simply count the number of pixels (bins/grid cells) they share in common. Another way is to decompose each hull into a set of triangles (i.e., obtain a triangle mesh per hull), and then compute the overlapping area between each pair of triangles.

Another approach would be to look at the distributions of the points which are assumed to be normal distributed. We can create a multivariate (for each dimension) normal distribution for each cluster. Could I use the Bhattacharyya distance for multivariate distributions to calculate the similarity and take that as a measure? I don't know if this is even a good approximation for the "overlap".

This is a totally reasonable thing to do. Once you have fitted multivariate normals to each cluster, there are many distances/similarity metrics you could use, some of which can be written analytically: Bhattacharyya coefficient, Frechet distance, KL or Jensen-Shannon divergence, Hellinger distance, etc... A distance can always be made a similarity via e.g. $T_1(d) = 1/(1+d)$ or $T_2(d)=\exp(-d)$.

The main problem here is how well the cluster is actually fit by a multivariate normal. For instance, consider a very curved cluster (e.g., that looks like a crescent moon). The normal may struggle to really represent it well, in which case the overlap will be over/underestimated. Indeed, for clustering algorithms in general, such as $k$-means, it is always a concern whether it is biased towards clusters of a particular shape (e.g., see here)

Of course we could also look at the points instead of the distributions. For this approach I have no idea yet.

Well firstly, any of the probabilistic measures just above, like the Bhattacharyya distance $D_B$, can be computed on the empirical distributions without fitting a normal distribution. For instance, we can write $$ D_B(p,q) = \mathbb{E}_{x\sim p}\sqrt{\frac{q(x)}{p(x)}} \approx \frac{1}{|C_p|}\sum_{x\in C_p}\sqrt{\frac{\widehat{q}(x)}{\widehat{p}(x)}} $$ where $p$ is the probability distribution associated to cluster $C_p$ and $\widehat{p}$ is the empirical density function estimated via e.g. kernel density estimation. The benefit of this approach is that you no longer need to assume normality; the downside is that density estimation is hard, especially in high dimensions (though you can always reduce dimensionality with e.g. PCA first).

Another way to compute a distance between probability distributions is via the Wasserstein distance $W$ (also called the Earth Mover's distance). This can be computed directly on the point set, thus avoiding density estimation. I think this one is very intuitive since it behaves well (smoothly) as the clusters move apart, which some $f$-divergences will not. There are also well-studied algorithms for computing it.

Related Question