Solved – Similarity measure between multiple distributions

distancedistributionssimilarities

To compare distributions, it is common to use box blots.

I'm looking for a similarity measure that calculates whether distributions are
the similar or not.

Ideally, given that e.g. four distributions are exactly the same,
it would return 1.
The more they are different, it would approach to 0.

To compute that, is it mandatory to have the same sample size?

Thanks for your help.

Best Answer

It's more common to measure discrepancy than similarity, but some of them can be converted easily to your way around.

Possible measures of discrepancy in distribution include (but are not limited to):

Kolmogorov-Smirnov distance. This distance between cdfs (or emprical cdfs), $D$, is small when the distributions are the same and close to 1 when they're very different, so $1-D$ should have the property you seek and doesn't require the same number of observations (indeed many of these measures don't).

Bhattacharyya distance. The Bhattacharyya coefficient, to which it is related (see the article) is a measure of similarity of distributions of the form you suggest.

Information-divergence. This is not symmetric (so D(x,y) is not D(y,x), and is not a metric), but it can be made symmetric (e.g. by looking at D(x,y)+D(y,x) for example) and there are some related metric distances to this divergence measure.

Chi-square distance: A variety of related measures get this name, used for discrete data (or discretized continuous data) -
$\quad$ I'll mention one: $d(x,y) = \frac{1}{2}\sum \frac{(x_i-y_i)^2 }{ x_i+y_i }$. This, as with the other chi-square distances, requires discretization into the same set of categories for both variables, and the x's and y's are proportions of their total category counts. This distance lies between 0 and 1, and is converted to a similarity by subtracting from 1.

Related Question