Solved – Test whether data set approximates normal distribution using mean and median

mediannormal distribution

I am of the knowledge that to test whether a data set approximates a normal distribution, the median and the mean should be approximately equal. So my question is to what degree should the difference between the median and the mean be accepted?

Best Answer

As I said in comments, you could work out (via simulation at the very least) a distribution for the difference in sample mean and sample median, which would be symmetric about 0 and whose variance multiplied by n would asymptotically go to some constant. As such you could construct some kind of test for normality, but it would be a pretty poor test for it, since it would have fairly poor power against a host of symmetric alternatives that aren't normal -- nor indeed even against asymmetric alternatives that happen to have mean=median. If you're interested in assessing normality, there are certainly better ways.

To answer the question though, this paper says that asymptotically, that constant I mentioned is $\pi/2-1$ (that is the variance of $\bar x - \tilde{x}$ in large samples is about $0.571\sigma^2/n$. In small samples, it's a bit smaller. As a rough rule of thumb, you expect the standard deviation of the difference between mean and median to be about $0.75 \sigma/\sqrt{n}$ (in odd samples; a bit smaller for even $n$).

Simulation of 10000 samples of size 25 gives a constant of $0.7390$ (that is, the s.d. of the difference was about $0.739\sigma/\sqrt{n}$, which is consistent with the results from the paper.

This boils down to basically using Pearson's second skewness coefficient as a way of assessing normality (I haven't used the factor of 3 here, though - I agree with Nick Cox's comment below that it's more intuitive without it in any case). That's sometimes called the nonparametric skew (though there's nothing that makes it any more nonparametric than any other skewness coefficient).

Now, considering it as a test statistic, since $\sigma$ will generally be unknown, it must usually be estimated; except for large samples (when we may apply Slutsky's theorem), this will lead to a coefficient that's heavier tailed than normal - though not actually t-distributed, it will probably be close to it* - meaning a critical value will tend to be larger for smaller samples. This will somewhat counteract the effect above of the coefficient being smaller with smaller $n$, though not completely; an asymptotic 5% test rejects when $\frac{\mid\bar x - \tilde{x}\mid}{k.s/\sqrt{n}}>1.96$ where $k = \sqrt{\pi/2-1}\approx 0.756$.

Using 0.75 for $k$ is easy to remember and works quite well for odd $n$ down to about 25 or even a bit lower; the actual significance level at $n$ = 25 is close to 4.5% (on normal data, naturally). It's a reasonably easy test to remember even if it's not always useful.

* though we won't know suitable approximate df without further effort