Solved – Jensen-Shannon divergence for finite samples

binningdistanceentropyfinite-populationinformation theory

I have two finite samples $s_1$ and $s_2$ and two distributions $p_1(s_1)$ and $p_2(s_2)$ that are associated to these samples. I'm essentially interested to measure the distance or similarity between these two distributions. I'm currently using the Jensen-Shannon (JS) distance, which involves in calculating the entropies of the two distributions separately and jointly.
I need to bin the data for calculating the basic entropy functions and it looks like the resulted JS distance is also a function of the data binning strategy that I use. It sounds quite arbitrary and I don't know how much I can trust the results.
Is there any way around this? How can I measure the distance between $p_1(s_1)$ and $p_2(s_2)$ in a more rigorous way that needs no data binning or any other sort of arbitrariness?

Best Answer

If your data are taken from a setting where it is useful or realistic to consider that your measurements are "noisy", it may be useful to pick a prior noise model for your data which means that both your distributions have support over the same space.

A simple approach would be to consider that your distributions are Gaussian mixtures with means at the observed data points and a single common variance. This "smoothing" gives your distribution support on the whole real line, for 1-dimensional data. It then remains to calculate Kullback-Leibler divergences between the two Gaussian mixtures: there is no closed form solution for this, but you can compute it numerically using the approaches in Hershey and Olsen (1), e.g. by Monte Carlo sampling. For higher-dimensional data you could use multivariate Gaussian mixtures (with full rank covariance matrices, for the reason given by Matus Telgarsky in the MathOverflow discussion 2)

Of course this has not removed all arbitrariness - we still have the common prior variance (and the choice of noise model). But it should be easy to study numerically the behaviour of the Jensen-Shannon divergence as you vary the single parameter of the noise model.

Related Question