Normality Assumption – Metric to Measure How ‘Standard Gaussian’ a Set of Samples Is for Hypothesis Testing

hypothesis testingmetricmultivariate normal distributionnormality-assumptionreferences

Assume that I have a set of $N\in\mathbb{R}^{D}$ samples from some otherwise unknown multivariate distribution $p$. I seek a metric which might tell me how "close" $p$ is to a standard multivariate Gaussian distribution $\mathcal{N}\left(\mathbf{0},\mathbf{I}\right)$. I am particularly interested in metrics which can pick up on tricky forms of non-Gaussianity, such as those in the Datasaurus Dozen (below).

Can you recommend any metrics which might serve this purpose?

enter image description here

Best Answer

One important question is whether you want a metric that says, for example, that a Binomial(0.5, 20) distribution is close to a Normal (because obviously) or far from a Normal (because it's discrete).

Another question is whether you want a metric for all distributions or just one for empirical distributions in datasets (so everything is discrete and has probability masses that are multiples of $1/n$.

Yet another question is how much you care about ease of computation, and in how many dimensions.

For a metric on all distributions that says the Normal and Binomial are close, you want something that metrizes convergence in distribution. The total variation distance or the Hellinger distance would be good.

For a metric on all distributions that says the Normal and Binomial are not close, you want something based on the likelihood ratio. If it doesn't have to be literally a metric in the topological sense, the Kullback-Leibler divergence would do; if it does, you could use the symmetrised Kullback-Leibler divergence

For something that's easy to compute on data sets (but hard to work with mathematically) you could use a nearest-neighbour distance such as $\max_y\min_x d(x,y)$ for the distance between a point in one data set and its nearest neighbour in the other (that's not symmetric, you can add the same thing with $x$ and $y$ switched to make it symmetric). Or replace the $\max_y$ by a mean over $y$ or something.

Related Question