Multivariate Distributions – Measuring the Distance Between Two Multivariate Distributions

distance-functionsmultivariate analysisterminology

I'm looking for some good terminology to describe what I'm trying to do, to make it easier to look for resources.

So, say I have two clusters of points A and B, each associated to two values, X and Y, and I want to measure the "distance" between A and B – i.e. how likely is it that they were sampled from the same distribution (I can assume that the distributions are normal). For example, if X and Y are correlated in A but not in B, the distributions are different.

Intuitively, I would get the covariance matrix of A, and then look at how likely each point in B is to fit in there, and vice-versa (probably using someting like Mahalanobis distance).

But that is a bit "ad-hoc", and there is probably a more rigorous way of describing this (of course, in practice I have more than two datasets with more than two variables – I'm trying to identify which of my datasets are outliers).

Thanks!

Best Answer

There is also the Kullback-Leibler divergence, which is related to the Hellinger Distance you mention above.