I am of the knowledge that to test whether a data set approximates a normal distribution, the median and the mean should be approximately equal. So my question is to what degree should the difference between the median and the mean be accepted?
Solved – Test whether data set approximates normal distribution using mean and median
mediannormal distribution
Related Solutions
The observation that in an example involving data drawn from a contaminated Gaussian distribution, you'd get better estimates of the parameters describing the bulk of the data by using the $\text{mad}$ instead of $\text{med}|x-\text{med}(x)|$ where $\text{mad}(x)$ is:
$$\text{mad}=1.4826\times\text{med}|x-\text{med}(x)|$$
--where, $(\Phi^{-1}(0.75))^{-1}=1.4826$ is a consistency factor designed to ensure that $$\text{E}(\text{mad}(x)^2)=\text{Var}(x)$$ when $x$ is uncontaminated-- was originally made by Gauss (Walker, H. (1931)).
I cannot think of any reason not to use the $\text{med}$ instead of the sample mean in this case. The lower efficiency (at the Gaussian!) of the $\text{mad}$ can be a reason not to use the $\text{mad}$ in your example. However, there exist equally robust and highly-efficient alternatives to the $\text{mad}$. One of them is the $Q_n$. This estimator has many other advantages beside. It is also very insensitive to outliers (in fact nearly as insensitive as the mad). Contrary to the mad, it is not built around an estimate of location and does not assume that the distribution of the uncontaminated part of the data is symmetric. Like the mad, It is based on order statistics, so that it is always well defined even when the underlying distribution of your sample has no moments. Like the mad, It has a simple explicit form. Even more than for the mad, I see no reasons to use the sample standard deviation instead of the $Q_n$ in the example you describe (see Rousseeuw and Croux 1993 for more info about the $Q_n$).
As for your last question, about the specific case where $x\sim\Gamma(\nu,\lambda)$, then
$$\text{med}(x)\approx\lambda(\nu-1/3)$$
and
$$\text{mad}(x)\approx\lambda\sqrt{\nu}$$
(in both cases the approximations become good when $\nu>1.5$) so that
$$\hat{\nu}=\left(\frac{\text{med}(x)}{\text{mad}(x)}\right)^2$$
and
$$\hat{\lambda}=\frac{\text{mad}(x)^2}{\text{med}(x)}$$
See Chen and Rubin (1986) for a complete derivation.
- J. Chen and H. Rubin, 1986. Bounds for the difference between median and mean of Gamma and Poisson distributions, Statist. Probab. Lett., 4 , 281–283.
- P. J. Rousseeuw and C. Croux, 1993. Alternatives to the Median Absolute Deviation Journal of the American Statistical Association , Vol. 88, No. 424, pp. 1273-1283
- Walker, H. (1931). Studies in the History of the Statistical Method. Baltimore, MD: Williams & Wilkins Co. pp. 24–25.
The two experiments are mostly unpaired, so the answer applies to that situation. The information regarding outliers defined as >5 IQR is not given. Regarding outliers, when you have nonsense solutions, they tend to be so crazy that identifying them is not especially challenging. However, formal tests for outliers can be performed, see link, which is obviously better than guessing if you have problematic outliers.
The single most general answer to your question is to perform the Wilcoxon rank-sum (Mann-Whitney U) test. This does not test median difference, as the test parameter is better than that, it is the U-statistic. The median difference test has lower power (efficiency) for moderate to large sample sizes.
The unpaired unequal variance t-test can be more powerful than the Mann-Whitney U test for unequal variances. However, t-test power is diminished by non-normal conditions. The delema of not knowing whether or not to use the equal or unequal variance assumption for t-testing is called the Behrens Fisher problem. And in general one answer is to always use the unequal variance test, and the other is to do separate significance of difference of variance testing.
Knowing the above, I would usually do the following: Calculate the Mann-Whitney U and unequal variance unpaired t-tests for different conditions just to see which group of statistics functions better at separating the various experimental conditions. Note, for the non-normal data, I would transform to normal conditions before applying t-testing, but, (this you can check for yourself) it should make no difference for the U-statistics. For example, if for one test I find serial tests to yield p-values of 0.0001, 0.0002, 0.8000, 0.9000, I would tend to favor that test over another test with respective p-values of 0.01, 0.02, 0.08, 0.09. I would also like my colleagues on this site to comment on the latter personal observation, as I have not heard tell of it from anyone other than myself.
Edit: The OP question has changed, but is still too broad.
I cannot fully answer the questions here, all I can is give some general indications, as the full answer is book length.
1) One most frequently does not know whether it is better to use a mean value or a median value. For an empirical distribution, one could test using bootstrapping. Frequently, that is not necessary or useful, as the approximate distribution type is usually sufficient to determine which is better. Moreover, the question is also largely irrelevant as unlike the mean, the median is at most rarely MVUE. The alternative to the mean is not generally the median, it is, in the context provided, the U-statistic as above. Even in the rare cases in which the median provides a better measure (i.e., lesser variance estimator) of location than the mean, the median may not be MVUE. For example, as per "4)" below; the Cauchy distribution, the 25% trimmed mean is a better measure of location than the median.
To be fair, there are special situations in which the median value of a measure would be a preferred measurement. One example of this that I have had experience with is given here. Suppose we want to measure concentration of a radioactive substance in blood plasma aliquots. The relevant question is how many aliquots of each sample-time to take, and what statistic to use how to better measure concentration. Suppose we take two aliquots. Then the mean and median are the same. Indeed, most people take only two aliquotes. However, that is problematic. Although the counting statistics (Poisson) are approximately normal for the counts usually acquired during measurement (~10000), and thus the mean-value would appear to be a good measure, the counting error is small (~1%), and is swamped by the pipetting error (up to 6% absolute error). Pipetting error is approximately Laplace or Cauchy distributed. One solution is then to take three samples, and use only the middle or median value as the estimate because the mean value is not very stable. I do not claim here that the median value is the best possible measurement for even three samples, just that it is better than the mean value in this particular case. What taking the median does in this case is make it less likely that the value chosen has a large error, it does not eliminate that error, it just reduces the likelihood of that error being large. Now comes the paradox that confuses most people. We did not eliminate the effect of the other two measurements by 'ignoring' them. Rather, we ranked three measurements and chose the most moderate ranked as being generally a more stable representation of the true value than the averaging more wild values would have allowed. When distributions are long tailed, that is not uncommon. Another example, suppose there are 10 people in a room who are colleagues and we want to know what salary we can expect by becoming a colleagues based on their incomes. Income is notoriously long-tailed, and if we take an average we will inflate our expectation severely if one of those colleagues makes seven figures a year and the others make only five or six. In that case, the median would be a better expectation of our future salary. However, we might have been better off still taking the average of the log of the salaries, (as in average "figures") and then the antilog of that average (geometric mean). Even better, provided all the colleagues are employed and actually earning money, would be to take the reciprocal of the average reciprocal of earnings, called the harmonic mean, which would be the least inflated estimate for a Pareto distribution (von Hippel, P.T., Scarpino, S.V., Holas, I.: Robust estimation of inequality from binned incomes. Sociological Methodology 46(1), 212--251 (2016). https://arxiv.org/pdf/1402.4061)
2A) It is frequently not necessary to trim outliers, just find a better, more physical data model. The most common phenomenon in biology is to use empirical data processing models that are inherently I) unstable II) and physically incorrect on two counts i) The units often do not balance and ii) The assumptions are frequently unphysical. Ad i) Discard completely unbalanced measurement systems, and rely only on their balanced equivalents. Personal experience on that includes an approximate two fold increase in precision/accuracy by ignoring body surface area "normalization" of glomerular filtration rate. A knowledge of proper body scaling is useful. Ad ii) The remedy for unrealistic assumptions is to write physically more correct models, e.g., discard 'instant mixing' replace with 'slow mixing'. Note, however, as George Box said, all models are wrong but some are useful. Personal experience on that is that by using more physical models, I was able for one biological problem to reduce the occurrence of physically ridiculous results by 20 fold or from 4% to <0.2%, while increasing precision and accuracy by ~2 fold.
2B) Sometimes it is necessary to account for outliers but even then the accounting may be better done by transforming variables and/or changing statistical models than by trimming results that one does not 'like'. In order to trim an outlier result, one should be prepared to state exactly what went wrong. That requires a lot more thinking than just using intuition. For example, it is frequently ignored that regression of a data analysis model should ALWAYS converge. When that is done some surprises can occur. From personal experience, the convergent answer can be a complex number solution, which when physically inappropriate, signals that the biological model used is literally unrealistic/unphysical. The remedy for that may be to use adaptively targeted regularization (or similar if less optimal techniques) and/or substitution of a more physical model.
3) Not quite the same but some similarity. As above, trimmed calculations can be more stable than the mean or median, see measures of location. The gist of this difference is that what you may think is an outlier, may actually be an expected occurrence for the typical range of some distribution types, e.g., the biologically very common Student's-t distributions of low degrees of freedom.
4) Sometimes the median is more stable than the mean, the mean can be undefined, e.g., Cauchy distribution, see link for measures of location above. That still does not justify the use of the median in other than relative terms, it is generally not, as above, an ideal measurement.
Best Answer
As I said in comments, you could work out (via simulation at the very least) a distribution for the difference in sample mean and sample median, which would be symmetric about 0 and whose variance multiplied by n would asymptotically go to some constant. As such you could construct some kind of test for normality, but it would be a pretty poor test for it, since it would have fairly poor power against a host of symmetric alternatives that aren't normal -- nor indeed even against asymmetric alternatives that happen to have mean=median. If you're interested in assessing normality, there are certainly better ways.
To answer the question though, this paper says that asymptotically, that constant I mentioned is $\pi/2-1$ (that is the variance of $\bar x - \tilde{x}$ in large samples is about $0.571\sigma^2/n$. In small samples, it's a bit smaller. As a rough rule of thumb, you expect the standard deviation of the difference between mean and median to be about $0.75 \sigma/\sqrt{n}$ (in odd samples; a bit smaller for even $n$).
Simulation of 10000 samples of size 25 gives a constant of $0.7390$ (that is, the s.d. of the difference was about $0.739\sigma/\sqrt{n}$, which is consistent with the results from the paper.
This boils down to basically using Pearson's second skewness coefficient as a way of assessing normality (I haven't used the factor of 3 here, though - I agree with Nick Cox's comment below that it's more intuitive without it in any case). That's sometimes called the nonparametric skew (though there's nothing that makes it any more nonparametric than any other skewness coefficient).
Now, considering it as a test statistic, since $\sigma$ will generally be unknown, it must usually be estimated; except for large samples (when we may apply Slutsky's theorem), this will lead to a coefficient that's heavier tailed than normal - though not actually t-distributed, it will probably be close to it* - meaning a critical value will tend to be larger for smaller samples. This will somewhat counteract the effect above of the coefficient being smaller with smaller $n$, though not completely; an asymptotic 5% test rejects when $\frac{\mid\bar x - \tilde{x}\mid}{k.s/\sqrt{n}}>1.96$ where $k = \sqrt{\pi/2-1}\approx 0.756$.
Using 0.75 for $k$ is easy to remember and works quite well for odd $n$ down to about 25 or even a bit lower; the actual significance level at $n$ = 25 is close to 4.5% (on normal data, naturally). It's a reasonably easy test to remember even if it's not always useful.
* though we won't know suitable approximate df without further effort