Solved – KL divergence or similar “distance” metric between two multivariate distributions

bivariateclusteringdistancedistributions

I have a large dataset composed of many samples; each sample is as follows:

  • imagine a grid indexed by i,j
  • for a sample k, I have Y_k, where Y_k(i,j) is the probability density for k at (i,j)
  • of course, summing the entire grid (each cell in the grid) yields 1

I have many (~100 for now) such samples, with different probability distributions over the exact same fixed grid. The distributions don't follow any parametric model, as far as I can tell.

My question is: I would like to compute all vs. all "distances" between these distributions. The goal is to feed the distances into a clustering algorithm, etc to figure out the general "classes" of distributions …. although they are all different from each other, just looking at them visually I see that there is clear grouping….of course it would nice to be able to do this algorithmically for scale, etc.

Best Answer

This has by no means been substantiated as a proper measure of divergence, but I've had some luck with the Hausdorff distance between two samples (of multiple points).

The best way to understand it intuitively is as a game where a player must travel from a point in one set to a point in the other in as small a distance as possible, but a malevolent second player picks the starting point to maximize this distance. The resulting distance is the Hausdorff distance.

I don't know of any theoretical work showing its behavior as a distance measure on the underlying distributions, but I've used it as a target function in optimization algorithms with some success.