Nonparametric Methods – Finding Robust, Distribution-Free Distance Between Multivariate Samples

distance-functionsdistributionsnonparametricrobust

There are many distance functions for distributions out there, but I'm having a hard time wading through them all to find one that

  1. is "distribution-free", or "nonparametric", by which I mean only that it makes few / weak assumptions about the underlying distributions (in particular, does not assume normality);
  2. is robust to outliers.

(Of these two desired properties, (1) is considerably more important than (2).)

I realize that the features above would likely reduce the discriminating power of any measure, but they reflect the reality of the data I'm working with1.

If it helps to clarify the problem, I could post a small subsample of the data, with the features suitably cloaked (this is unpublished data owned by my collaborators). The one concern I have is that any subsample that is small enough to be "postable" as part of a CrossValidated post would be too small to adequately represent the entire dataset. I'd appreciate some guidance on the matter.


Background (aka tl;dr)

I originally set out to use the Bhattacharyya distance $D_B(\mathbf{x}, \mathbf{y})$ to measure distances between the sample distributions of various pairs of subsamples $(\mathbf{x}, \mathbf{y})$ in my dataset, but I quickly ran into the problem that the matrix $(\mathrm{cov}(\mathbf{x}) + \mathrm{cov}(\mathbf{y}))/2$, whose inverse is required to compute $D_B(\mathbf{x}, \mathbf{y})$2, is ill-conditioned for many of these pairs $(\mathbf{x}, \mathbf{y})$.

This led me to read more about the theory behind $D_B$, from which I gathered that the formula I had been using to compute it assumes that the underlying distributions are all normal. I figured that there may be some connection (however weak) between the numerical problems I'd run into and the fact that the distributions I am working with do not come even close to meeting this normality condition.

My intuition (which someone with more math-fu than mine may be able to justify more or less rigorously) is that the classic analytic distributions are powerful precisely because of the strong analytic constraints that give rise to their "fine/local structure", and hence to all the profound, far-reaching theorems we have about them. It is this body of theory that makes these distributions "powerful". If this hunch is at all true, one would expect that analytic results derived from such distributions would tend to be very sensitive to numerical imperfections (outliers, collinearity, etc.) in the data.

At any rate, I interpreted the numerical problems I was running into as being possibly a merciful hint from the Gods of Statistics that I was using the wrong tool for the job.

This is what sent me off looking for a "distribution-free"/"nonparametric" alternative to $D_B$.


1 The data consists of ~500 automatically-collected features of individual cultured cells. All the features have positive values. I looked at histograms of several randomly chosen features, based on random subsamples of the data, and did not find a single one that looked normally distributed; those that were unimodally bell-shaped, all had a significant skew. A few features had extreme outliers (so the histograms featured only one or two bins tall enough to be distinguishable from the empty bin).


The cells were cultured from patient biopsies, divided into ~2500 subcultures, which were given one of ~800 different possible treatments, including a "no treatment" control. The treatments themselves fall into ~200 different groups. Therefore, imagine partitioning all the observations into ~200 subsamples, one for each of these ~200 treatment groups. At the moment I'm interested in measuring the distances between the (multivariate) sample distributions corresponding to each of these subsamples and the control (no treatment) subsample.

2 More precisely, this inverse is required to compute $D_B$ using the formula I have for it. It is in the derivation of this particular formula, rather than in the definition of $D_B$ per se, that the normality assumption appears. I got the formula from Kailath's 1967 paper (Kailath, Thomas. "The divergence and Bhattacharyya distance measures in signal selection." Communication Technology, IEEE Transactions on 15.1 (1967): 52-60.).

Best Answer

First of all, I advise you to take a look at the Encyclopedia of Distances by Michel and Elena Deza. From quickly browsing through the pdf (e.g. pp. 327-330), you can already see a multitude of possible statistical measures for multivariate populations. Although simple, one of those might be good enough to approximate the statistical divergence between the various populations. Additionally, there are many more 'simplistic' statistical distances which you may want to consider. For example, you can Google the term Nonparametric multivariate distance and many distance measures will pop up.

In a more complicated manner, a first intuition could be to first preserve the treatment classification in the data and then to estimate the distances between two possible hierarchies/classifications, as the problem concerns multivariate data that can be classified into different subgroups. One such a measure is the Split-Order distance as can be found in the following paper:

Zhang et al. (2009), Split-Order Distance for Clustering and Classification Hierarchies, Lecture Notes in Computer Science, Vol. 5566, pp. 517-534. Download here

This technique (and similar techniques) try to classify the data according to different possible hierarchies. I am not completely sure whether this is applicable to the subculture structure that you mentioned but it might be interesting to take a look. However, this way of estimating the statistical distance relies heavily on the implementation of algorithms (and therefore computer science).

A more statistical way of looking at the problem would be to simply use the data as if categorized where you use the treatment classification as the split between different subpopulations. Therefore no specific hierarchy is assumed. The nonparametric measures which would then be useful are either based on bootstrapping or on the approximation of the moments of the underlying distribution, with the most famous being the method of moments. The distance measures based on this often rely on the simple assumption that the first and the second moment are finite. A good example of such a measure can be found in the following paper:

Székely, G.J., Rizzo, M.L., (2004), Testing for equal distributions in high dimension, Interstat 2004. Download here

Where the equality of two multivariate distributions is tested in a nonparametric way. Another interesting nonparametric test which is based on data depth can be found in:

Chenouri, S., Farrar, T.J. (2012), A Two­sample Nonparametric Multivariate Scale Test based on Data Depth, Electronic Journal of Statistics, Vol. 6, pp. 760-782. Download here

Now apart from testing for the disparity between two samples, you might just want to interpret a statistical difference between them. In that case, you might want to look into divergence measures such as the Bhattacharyya distance, or f-divergences such as the Hellinger distance.

All these measures have different advantages and disadvantages and should be employed under specific conditions. Be sure to always mind the scale of your variables as large scales will contribute disproportionately to any measure. So if variables are measured on different scales, use standardized values before computing distances. So for $n$ samples (groups), variables should be standardized to zero mean and unit variance over all $n$ groups. Good luck!

P.S. Note that robust statistics often use a different penalisation function such as the mean-absolute deviation instead of squaring the distance between the observation and the mean. This might help your search for a robust measure.
Related Question