Solved – KL divergence or similar “distance” metric between two multivariate distributions

bivariateclusteringdistancedistributions

I have a large dataset composed of many samples; each sample is as follows:

imagine a grid indexed by i,j
for a sample k, I have Y_k, where Y_k(i,j) is the probability density for k at (i,j)
of course, summing the entire grid (each cell in the grid) yields 1

I have many (~100 for now) such samples, with different probability distributions over the exact same fixed grid. The distributions don't follow any parametric model, as far as I can tell.

My question is: I would like to compute all vs. all "distances" between these distributions. The goal is to feed the distances into a clustering algorithm, etc to figure out the general "classes" of distributions …. although they are all different from each other, just looking at them visually I see that there is clear grouping….of course it would nice to be able to do this algorithmically for scale, etc.

Best Answer

This has by no means been substantiated as a proper measure of divergence, but I've had some luck with the Hausdorff distance between two samples (of multiple points).

The best way to understand it intuitively is as a game where a player must travel from a point in one set to a point in the other in as small a distance as possible, but a malevolent second player picks the starting point to maximize this distance. The resulting distance is the Hausdorff distance.

I don't know of any theoretical work showing its behavior as a distance measure on the underlying distributions, but I've used it as a target function in optimization algorithms with some success.

Related Solutions

Solved – quantitative way to compare the distribution shape of different samples

If the problem is uni-variate, then why not just do a KS test on the (centered, re scaled) vectors?

You can't use the associated pvalues (because the center and scale components have been determined by the data) but the D statistics gives a relative measure of the distance between the two vectors (In a nutshell, it's simply the Chebychev distance between the two CDF).

So, in R, it would be (assuming x and y are two vectors of potentially different lengths (each vector contains one of the sample whose shape of the distribution you want to compare).

For example, if $x\sim\mathcal{P}(\lambda)$ and $y\sim\mathcal{N}(\mu,\sigma^2)$:

#two distributions with different shape
y<-rnorm(100,0,3)
x<-rpois(100,1)
x_s<-(x-median(x))/mad(x)
y_s<-(y-median(y))/mad(y)
par(mfrow=c(2,1))
hist(y_s)
hist(x_s)
ks.test(x_s,y_s)

P.S. I left the original answer, because it seemed to be useful and frankly took me time to write. @Modo: let me know if it's better to remove it.

Solved – Clustering based on large Jensen-Shannon Divergence distance matrix

Depending on specifics, consider the following alternatives. I am sure that you're familiar with some, but maybe not all methods. Additionally, some of the papers, which I've referenced below, describe algorithm modifications, which might be appropriate for your specific task and data sets.

K-means adaptations. For example, see this paper and this paper. Also, see this paper on using bootstrapping in K-means cluster analysis (while the paper is focused on the speed, the space improvement is IMHO implied as well due to nature of the bootstrapping process).
Model-based clustering: mixture modeling. This is an interesting option, implemented in several R packages, most notably mclust (http://www.stat.washington.edu/mclust). The approach, methods and software are well presented in this vignette paper.
Model-based clustering: Dirichlet processes (DP). Another popular option is Bayesian-based Dirichlet mixture models and hierarchical DP. If I understand the material correctly, (probabilistic) topic modeling also fits this category and includes such approaches, as latent Dirichlet allocation (LDA) (note: not to be confused with different method with same abbreviation - linear discriminant analysis (LDA), mostly used for dimensionality reduction, as I understand). More information on LDA: this introductory paper, some other relevant publications and a very recent paper on much improved LDA approach.
Hierarchical clustering (HC). In addition to traditional HC, you may find some interesting hybrid approaches, such as Dirichlet diffusion trees, which applies HC approach to DP mixtures (see this paper; other partial related research can be found via links on this page).
Latent variable modeling (LVM)-based clustering. Clustering applications include latent class analysis (LCA)-based latent class clustering (LCC) (see this paper) and latent tree models, described here. For some discussion, including a comparison between LCA and K-means, see this page and this paper.
Information theory-based clustering. For example, see this paper.
Neural networks- and genetic algorithms-based clustering. For example, see this paper.
If I remember correctly, I think that I've also seen some papers on using entropy for classification, but can't find them at the moment (will update, if that changes).
Some other interesting/relevant papers: a comparison of LCC and PAM; clustering with Bregman divirgences (probably belongs to information theory-based clustering category).
Some relevant discussions on Cross Validated: here and here
For determining an optimal number of clusters, see this excellent answer on StackOverflow.

Best Answer

Related Solutions

Solved – quantitative way to compare the distribution shape of different samples

Solved – Clustering based on large Jensen-Shannon Divergence distance matrix

Related Question