Solved – When does the amount of skew or prevalence of outliers make the median preferable to the mean

meanmedian

I know that "deviations in the data are devil", and when the distribution is highly skewed, it is better to consider median as average rather than mean, but how to decide these hard-limits.

For example:

CASE 1:
- Assume X = 10,20,30,40,50,60,70
- In this case, I think that it is better to use mean and that it will give very accurate results.
CASE 2:
- Assume X = 10,20,30,40,50,60,70,7000
- In this case, I think that it is better to use median instead of using the mean.
CASE 3:
- Assume X = 10,20,30,400,500,600,700
- In this case, I think it is better to use IQR (Inter Quartile Range)

But I'm stuck with how to decide these hard-limits i.e. which to use in which condition, in general.

I've found a tool working on subjected principle, which takes context-less sample-distribution as input and determines whether mean is close/moderate or against the null-hypothesis.

Find References:-

http://home.ubalt.edu/ntsbarsh/Business-stat/otherapplets/MeanTest.htm (For Mean Test)
http://home.ubalt.edu/ntsbarsh/Business-stat/otherapplets/MediansTest.htm (For Median Test)

What I'm really looking is a good answer which states how to derive these conclusions.

Best Answer

Framing the question

You are asking an applied and subjective question, and thus, any answer needs to be infused with applied and subjective considerations.
From a purely statistical perspective, the mean and median both provide different information about the central tendency of a sample of data. Thus, neither is correct or incorrect by definition.
From an applied perspective, we often want to say something meaningful about the central tendency of a sample, where central tendency maps onto some subjective notion of "typical".

General thoughts

When summarising what is typical in a sample, observations that are many standard deviations away from the mean (perhaps 3 or 4 SD) will have a large influence on the mean, but not the median. Such observations may lead the mean to deviate from what we think of as the "typical" value of the sample. This helps to explain the popularity of the median when it comes to reporting house prices and income, where a single island in the pacific or billionaire could dramatically influence the mean, but not the median. Such distributions can often include extreme outliers, and the distribution is positively skewed. In contrast, the median is robust.
The median can be problematic when the data takes on a limited number of values. For example, the median of a 5-point Likert item lacks the nuance possessed by the mean. For example, means of 2.8, 3.0, and 3.3 might all have a median of 3.
In general, the mean has the benefit of using more of the information from the data.
When skewed distributions exist, it is also possible to transform the distribution and report the mean of the transformed distribution.
When a distribution includes outliers, it is possible to use a trimmed mean, or remove the outliers, or adjust the value of the outlier to a less extreme value (e.g., 2 SD from the mean).

Related Solutions

Solved – Comparison of distribution mean or median

The two experiments are mostly unpaired, so the answer applies to that situation. The information regarding outliers defined as >5 IQR is not given. Regarding outliers, when you have nonsense solutions, they tend to be so crazy that identifying them is not especially challenging. However, formal tests for outliers can be performed, see link, which is obviously better than guessing if you have problematic outliers.

The single most general answer to your question is to perform the Wilcoxon rank-sum (Mann-Whitney U) test. This does not test median difference, as the test parameter is better than that, it is the U-statistic. The median difference test has lower power (efficiency) for moderate to large sample sizes.

The unpaired unequal variance t-test can be more powerful than the Mann-Whitney U test for unequal variances. However, t-test power is diminished by non-normal conditions. The delema of not knowing whether or not to use the equal or unequal variance assumption for t-testing is called the Behrens Fisher problem. And in general one answer is to always use the unequal variance test, and the other is to do separate significance of difference of variance testing.

Knowing the above, I would usually do the following: Calculate the Mann-Whitney U and unequal variance unpaired t-tests for different conditions just to see which group of statistics functions better at separating the various experimental conditions. Note, for the non-normal data, I would transform to normal conditions before applying t-testing, but, (this you can check for yourself) it should make no difference for the U-statistics. For example, if for one test I find serial tests to yield p-values of 0.0001, 0.0002, 0.8000, 0.9000, I would tend to favor that test over another test with respective p-values of 0.01, 0.02, 0.08, 0.09. I would also like my colleagues on this site to comment on the latter personal observation, as I have not heard tell of it from anyone other than myself.

Edit: The OP question has changed, but is still too broad.

I cannot fully answer the questions here, all I can is give some general indications, as the full answer is book length.

1) One most frequently does not know whether it is better to use a mean value or a median value. For an empirical distribution, one could test using bootstrapping. Frequently, that is not necessary or useful, as the approximate distribution type is usually sufficient to determine which is better. Moreover, the question is also largely irrelevant as unlike the mean, the median is at most rarely MVUE. The alternative to the mean is not generally the median, it is, in the context provided, the U-statistic as above. Even in the rare cases in which the median provides a better measure (i.e., lesser variance estimator) of location than the mean, the median may not be MVUE. For example, as per "4)" below; the Cauchy distribution, the 25% trimmed mean is a better measure of location than the median.

To be fair, there are special situations in which the median value of a measure would be a preferred measurement. One example of this that I have had experience with is given here. Suppose we want to measure concentration of a radioactive substance in blood plasma aliquots. The relevant question is how many aliquots of each sample-time to take, and what statistic to use how to better measure concentration. Suppose we take two aliquots. Then the mean and median are the same. Indeed, most people take only two aliquotes. However, that is problematic. Although the counting statistics (Poisson) are approximately normal for the counts usually acquired during measurement (~10000), and thus the mean-value would appear to be a good measure, the counting error is small (~1%), and is swamped by the pipetting error (up to 6% absolute error). Pipetting error is approximately Laplace or Cauchy distributed. One solution is then to take three samples, and use only the middle or median value as the estimate because the mean value is not very stable. I do not claim here that the median value is the best possible measurement for even three samples, just that it is better than the mean value in this particular case. What taking the median does in this case is make it less likely that the value chosen has a large error, it does not eliminate that error, it just reduces the likelihood of that error being large. Now comes the paradox that confuses most people. We did not eliminate the effect of the other two measurements by 'ignoring' them. Rather, we ranked three measurements and chose the most moderate ranked as being generally a more stable representation of the true value than the averaging more wild values would have allowed. When distributions are long tailed, that is not uncommon. Another example, suppose there are 10 people in a room who are colleagues and we want to know what salary we can expect by becoming a colleagues based on their incomes. Income is notoriously long-tailed, and if we take an average we will inflate our expectation severely if one of those colleagues makes seven figures a year and the others make only five or six. In that case, the median would be a better expectation of our future salary. However, we might have been better off still taking the average of the log of the salaries, (as in average "figures") and then the antilog of that average (geometric mean). Even better, provided all the colleagues are employed and actually earning money, would be to take the reciprocal of the average reciprocal of earnings, called the harmonic mean, which would be the least inflated estimate for a Pareto distribution (von Hippel, P.T., Scarpino, S.V., Holas, I.: Robust estimation of inequality from binned incomes. Sociological Methodology 46(1), 212--251 (2016). https://arxiv.org/pdf/1402.4061)

2A) It is frequently not necessary to trim outliers, just find a better, more physical data model. The most common phenomenon in biology is to use empirical data processing models that are inherently I) unstable II) and physically incorrect on two counts i) The units often do not balance and ii) The assumptions are frequently unphysical. Ad i) Discard completely unbalanced measurement systems, and rely only on their balanced equivalents. Personal experience on that includes an approximate two fold increase in precision/accuracy by ignoring body surface area "normalization" of glomerular filtration rate. A knowledge of proper body scaling is useful. Ad ii) The remedy for unrealistic assumptions is to write physically more correct models, e.g., discard 'instant mixing' replace with 'slow mixing'. Note, however, as George Box said, all models are wrong but some are useful. Personal experience on that is that by using more physical models, I was able for one biological problem to reduce the occurrence of physically ridiculous results by 20 fold or from 4% to <0.2%, while increasing precision and accuracy by ~2 fold.

2B) Sometimes it is necessary to account for outliers but even then the accounting may be better done by transforming variables and/or changing statistical models than by trimming results that one does not 'like'. In order to trim an outlier result, one should be prepared to state exactly what went wrong. That requires a lot more thinking than just using intuition. For example, it is frequently ignored that regression of a data analysis model should ALWAYS converge. When that is done some surprises can occur. From personal experience, the convergent answer can be a complex number solution, which when physically inappropriate, signals that the biological model used is literally unrealistic/unphysical. The remedy for that may be to use adaptively targeted regularization (or similar if less optimal techniques) and/or substitution of a more physical model.

3) Not quite the same but some similarity. As above, trimmed calculations can be more stable than the mean or median, see measures of location. The gist of this difference is that what you may think is an outlier, may actually be an expected occurrence for the typical range of some distribution types, e.g., the biologically very common Student's-t distributions of low degrees of freedom.

4) Sometimes the median is more stable than the mean, the mean can be undefined, e.g., Cauchy distribution, see link for measures of location above. That still does not justify the use of the median in other than relative terms, it is generally not, as above, an ideal measurement.

Solved – Kolmogorov Smirnov Z vs Mann Whitney U small sample size n= 15

If the original statement doesn't limit the conditions under which it applies pretty substantially, Field is just wrong on this.

Responding to the quoted section:

In effect, this means it does much the same as the Mann–Whitney test!

No, it really doesn't. They really test for different kinds of things. As one example, if two close-to-symmetric distributions differ in spread but don't differ in location, the Kolmogorov-Smirnov can identify that kind of difference (in large enough samples relative to the effect) but the Wilcoxon-Mann-Whitney can't.

This is because they're designed for different purposes.

"However, this test tends to have better power than the Mann–Whitney test when sample sizes are less than about 25 per group, and so is worth selecting if that’s the case."

As a general claim, this is nonsense. Against the things the Mann-Whitney doesn't test it has better power, but against the things the Mann-Whitney is meant for, it doesn't. This doesn't change when $n<25$.

[There may be some situation where the claim is true; if Field doesn't explain what context his claim applies in, I'm not likely to be able to guess it.]

Here's a power curve for n=20 per group. The significance level is a bit over 3% for each test (in fact the achievable significance level for the KS is slightly higher and I have not attempted to use a randomized test to adjust for that difference so it's been given a small advantage in this comparison):

As we see, in this case (the first one I tried) the Wilcoxon-Mann-Whitney is clearly more powerful.

At n=5, the Kolmogorov-Smirnov remains less powerful for this situation. [So what the heck is he talking about? Is he comparing power for some situation not mentioned in the quote? I don't know, but going just on what's quoted here we should not take that claim at face value. It was wrong in the first thing I checked, and - based on a broader familiarity with the two tests, I would readily bet it's wrong for a bunch of other situations.]

At sample sizes of 4 and 11 for shift alternatives (and normal populations), again, the Wilcoxon-Mann-Whitney does better.

With the variable you're looking at, a suitable alternative is probably something more like a scale shift; but if some power (like a square root or a cube root say or better still a log) of your data aren't too non-normal looking these results I mention should be relevant. If you have discrete or zero-inflated data that may make some difference, but my bet would be that the Kolmogorov-Smirnov doesn't overtake the Wilcoxon-Mann-Whitney then either. [I won't pursue this at present because it's not clear if it's relevant for your situation.]

In addition, the attainable significance levels with the Kolmogorov-Smirnov are very gappy at small sample sizes. You often can't get tests close to the usual significance levels you are likely to want. (The WMW does much better than the KS in relation to available test sizes. There is a neat way way to dramatically improve this gappiness of levels situation without losing either the nonparametric or the rank based nature of tests like these - that also doesn't involve randomized testing - but it seems to be very rarely used for some reason.)

Note that I carefully chose examples that made the levels of the two tests close to comparable. If you're just choosing $\alpha=0.05$ every time without regard to the available levels and comparing a p-value to that, then the gappiness of the Kolmogorov-Smirnov's attainable levels is going to make its power much worse in general (though will very occasionally help it a little as here -- these advantage will not generally be much though and probably not enough to help it beat the WMW at the task it's suited for).

If you're in a situation where the Wilcoxon-Mann-Whitney tests what you want to test, I would definitely not recommend using the Kolmogorov-Smirnov instead. I'd use each test for what they're designed to test, which is where they tend to do fairly well.

The best way to figure out what's best is to try some simulations in situations that would be realistic for the kind of data you will have. Then you can see when it does what.

Also when reporting the intakes along with the p values, should I use mean and standard deviation or median and IQR as data is non- parametric?

Data are just data. They're neither parametric nor nonparametric -- that's a property of models and inferential procedures that we use which rely on them (estimation, testing, intervals). Parametric means "defined up to a fixed, finite number of parameters", which is not an attribute of data but of models. If you can't just give both sets of values (which would be my preference) and must instead choose one or the other, which is more relevant scientifically or in relation to your question of interest?

[Note that the Wilcoxon-Mann-Whitney doesn't compare either means or medians (unless you add some assumptions I bet don't come close to applying in this case). Nor does the Kolmogorov-Smirnov.]

Also when reporting the intakes along with the p values, should I use mean and standard deviation or median and IQR

My general advice is to report what makes sense to report for that variable (without worrying very much about what its distribution might be); if you want to know something about the population mean, the sample mean generally makes sense to report, similarly for the population median. Personally I rarely look at only one summary statistic and when reading a paper, I am interested in more than one.

Neither sample means nor sample medians will correspond to what either of the tests here are comparing.