Distance Metric – Comparing Two Sample Distributions (Histograms)

probability distributionsst.statisticsstatistical-physics

Context: I want to compare the sample probability distributions (PDFs) of two datasets (generated from a dynamical system). These datasets depend on a set of parameters, and I want a concise way to evaluate the distance between the two PDFs over several different parameter regimes, ideally by a single number. For a fixed parameter regiume, my two sample PDFs are given by the vectors $x$ and $y$, where $x_i$ is the relative frequency of samples which lie in the $i$th bin.

One method I've seen is the Kolmogorov-Smirnov statistic, which is the maximum vertical distance between the cumulative distribution functions of the two datasets. This would work for my purposes, but I'm starting to think that the chi-squared distance will be better (at the very least I had heard of it). It is given by: $d(x,y) = \frac{1}{2}\sum_i \frac{(x_i-y_i)^2}{x_i+y_i}$. But this doesn't seem to make sense in the case where $x_i=y_i=0$, which seems like a fairly reasonable occurrence.

Any recommendations?

Edit: My first instinct was just to take the $L^2$ ($l^2$) norm of the difference of the two PDFs (sample PDFs). But then I thought the $L^1$ norm might be more suitable for a probability distribution. After looking a little further, I stumbled on the K-S and Chi-Squared distances.

Best Answer

In the first place the answer depends on the nature of your data (e.g., numerical continuous, numerical discrete, nominal etc.). In each of these cases the empirical measures on the range of your data have to be compared by using the corresponding specific methods. I presume that your values are reals (since you evoke the KS distance). In this case the most natural for your problem topology on the space of measures is weak, and by no means the norm (total variation) one, as empirical measures you deal with will typically be pairwise singular.

The weak topolgy, indeed, can be metrized by the "ad hoc" Levy-Prohorov metric, however, the transportation metric (which also metrizes the weak topology) is by far more canonical and appropriate in this situation (it was evoked in R Hahn's answer under the names of Earth-Mover's distance and Wasserstein metric). For instance, the KS distance between two distinct $\delta$-measures is always 1, their total variation distance is 2, whereas the transportation distance between them is equal to the distance between the corresponding points, so that it correctly reflects their similarity. Another advantage is that, unlike the LP metric, the transportation metric on the line can be easily computed explicitly (by using its dual description in terms of Lipschitz functions).

The metrics used in statistics, generally speaking, serve a purpose which is different from yours, and do not make much sense in your situation, when comparing two mutually singular empirical distributions. Most of the statistical distances only make sense for equivalent measures (like the Kullback-Leibler deviation). Of course, you can always discretize your data by using bins, however you will lose information about the data by making it "coarser" (which is OK for typical statistical purposes, but is not necessarily so for you). I do not see any reason to do that if one can efficiently work with the original data by metrizing the weak topology.

Related Question