Solved – Distance measure between probability density functions

distance-functionsdistributionsmachine learningprobability

I have some generated data that I want to follow a given Gaussian distribution $N(\mu,\sigma)$, and I would like to quantify the distance between the generated distribution and $N(\mu,\sigma)$.

Specifically, I am generating a set of atomic positions and I want the distribution of the lengths between each atom and its $k$ nearest-neighbors (k is fixed) to be as close as possible to $N(\mu,\sigma)$. For that purpose, I sample $m$ points from $N(\mu,\sigma)$, where $m$ is the number of generated atoms, and I would like to evaluate the distance between the two distributions (generated and sampled from $N$).

The generating process is a deep neural network, which implies that I need this distance to be as smooth (continuous and differentiable) as possible so that the DNN gets the most meaningful gradient information possible.

In this case, it is more important to match the mean and the standard deviation $(\mu,\sigma)$ than to match the actual Gaussian shape. I am a bit lost in the wide range of different measures that are available..

My question

What would be a suitable statistical distance in this case ?

Should I always compare the Cumulative/Empirical distributions functions ?
Are there other ways to compare PDFs ?

Best Answer

There are quite a few such measures between probability distributions, each with their own properties and goals. For example, you could use earth mover's distance. If these measures don't emphasize the mean and SD enough for your tastes, you can add the absolute distances between the observed and desired mean and the observed and desired SD as penalty terms when evaluating your neural network.

Note that in any case, it makes sense to compare the empirical distribution of your simulated atomic distances to the theoretical normal distribution, rather than using the empirical distribution of a simulated normal distribution.