Multivariate Analysis – Calculating Mahalanobis Distance to a Multivariate Distribution with Few Sample Points

classificationdistance-functionsmultivariate analysis

Say I have a set of sample points generated by a multivariate normal distribution D whose parameters I don't know.

I want to be able to measure the distance from an arbitrary point to the distribution D.

One way of doing this would be to get an estimate of the parameters of D, and use it to get the manalanobis distance to the center of D.

However, this gets wrong as the size of the sample gets small, or as the number of parameters go up.

An additional information I can use is a bunch of other similar sets of sample points belonging to similar distributions – so I could use that to get a "prior" over the parameters of D, and then update that according to the sample points I know it generated.

My intuition is that I could have a mahalanobis distance DC using the covariance matrix of my sample points, and a mahalanobis distance DP using the covariance matrix of all the data points I have (not only those belonging to my distribution), and use a weighted sum of them as my distance metric function of the size of my sample from D (if the size is small, I'd better use DP, if it's large, I can use DC). But I'm not sure how to formalize that exactly, and feel I'm missing some conceptual tools.

(this is similar to a standard classification problem, but here I'm not interested in class membership, only in distance to a given class – the other classes are only useful to provide priors)

So, how would you formally describe this problem? Is there a standard solution?

Best Answer

If you have very little data, it is not that the distance estimate is wrong, but that your estimate is uncertain. A Bayesian approach would seek to determine the posterior distribution of the distance between the arbitrary point and the multi-variate distribution, rather than a single point estimate, and then marginalise over that posterior in reaching your conclusion. This posterior distribution reflects the uncertainty in estimating the mean and covariance matrix of the multivariate Gaussian distribution.

I would be wary of using an informative prior. In a Bayesian analysis, the conclusions are only as strong as the prior assumtions on which they are based; if the prior is questionable - so is the posterior. Without more information about the problem it is not possible to determine if such a prior is reasonable.