Solved – Comparison of distribution mean or median

meanmediannoiset-test

I am working with very noisy biological data for which I will compare two experimental settings. For each setting I will get a set of measure with a huge variance, sometimes with a skewed distribution, but sometimes not (after log transformation).
If I want to statistically compare my two experimental settings should I compare the means, with the appropriate statistical test (e.g. t-test) or should I compare the medians (typically with bootstrapping although I know some statistical tests exist)?
I am well aware of why the median is a better description of the average of a skewed distribution, also it better describes a distribution where you do not have good estimates of extreme values. I have read several other thread about "mean vs median" (e.g. Is median fairer than mean?)

However my question is specific to comparing two noisy datasets, like in biology, you can assume that some extremes values are incorrect measurements, hence the median could be more robust. In general, data are unpaired, and it is actually beyond our current understanding of the process whether outliers are meaningful or not (very noisy measurements but the data itself is also very variable). However comparing the median seems a very naive approach specifically in the case when you could get a normal distribution. What is the best approach for such comparisons ? Are there any scientific papers I can refer to?

Edit:Following comments I will try to make this question less vague.

If have a to compare two distributions of unpaired data, that represents noisy data, but for which you can not tell whether outliers are meaningful or not. This happens a lot in biology where some processes are largely unkown and very variable between individuals. Let's say that after a log transformation your points are kind of normally distributed.

1) How do you go about deciding which is the best test to compare the two experimental setups between mean or median?

2) What is the best approach concerning these possible true outliers but could also be biological noise? Discarding them typically means increasing the power to see a difference between the two distribution, but isn't that some form p-hacking?

3) Isn't using the median instead of the mean the same as discarding the outliers?

4) Putting 3 and 4 together, how can you justify using the median for such comparisons as not doing p-hacking?

Best Answer

The two experiments are mostly unpaired, so the answer applies to that situation. The information regarding outliers defined as >5 IQR is not given. Regarding outliers, when you have nonsense solutions, they tend to be so crazy that identifying them is not especially challenging. However, formal tests for outliers can be performed, see link, which is obviously better than guessing if you have problematic outliers.

The single most general answer to your question is to perform the Wilcoxon rank-sum (Mann-Whitney U) test. This does not test median difference, as the test parameter is better than that, it is the U-statistic. The median difference test has lower power (efficiency) for moderate to large sample sizes.

The unpaired unequal variance t-test can be more powerful than the Mann-Whitney U test for unequal variances. However, t-test power is diminished by non-normal conditions. The delema of not knowing whether or not to use the equal or unequal variance assumption for t-testing is called the Behrens Fisher problem. And in general one answer is to always use the unequal variance test, and the other is to do separate significance of difference of variance testing.

Knowing the above, I would usually do the following: Calculate the Mann-Whitney U and unequal variance unpaired t-tests for different conditions just to see which group of statistics functions better at separating the various experimental conditions. Note, for the non-normal data, I would transform to normal conditions before applying t-testing, but, (this you can check for yourself) it should make no difference for the U-statistics. For example, if for one test I find serial tests to yield p-values of 0.0001, 0.0002, 0.8000, 0.9000, I would tend to favor that test over another test with respective p-values of 0.01, 0.02, 0.08, 0.09. I would also like my colleagues on this site to comment on the latter personal observation, as I have not heard tell of it from anyone other than myself.

Edit: The OP question has changed, but is still too broad.

I cannot fully answer the questions here, all I can is give some general indications, as the full answer is book length.

1) One most frequently does not know whether it is better to use a mean value or a median value. For an empirical distribution, one could test using bootstrapping. Frequently, that is not necessary or useful, as the approximate distribution type is usually sufficient to determine which is better. Moreover, the question is also largely irrelevant as unlike the mean, the median is at most rarely MVUE. The alternative to the mean is not generally the median, it is, in the context provided, the U-statistic as above. Even in the rare cases in which the median provides a better measure (i.e., lesser variance estimator) of location than the mean, the median may not be MVUE. For example, as per "4)" below; the Cauchy distribution, the 25% trimmed mean is a better measure of location than the median.

To be fair, there are special situations in which the median value of a measure would be a preferred measurement. One example of this that I have had experience with is given here. Suppose we want to measure concentration of a radioactive substance in blood plasma aliquots. The relevant question is how many aliquots of each sample-time to take, and what statistic to use how to better measure concentration. Suppose we take two aliquots. Then the mean and median are the same. Indeed, most people take only two aliquotes. However, that is problematic. Although the counting statistics (Poisson) are approximately normal for the counts usually acquired during measurement (~10000), and thus the mean-value would appear to be a good measure, the counting error is small (~1%), and is swamped by the pipetting error (up to 6% absolute error). Pipetting error is approximately Laplace or Cauchy distributed. One solution is then to take three samples, and use only the middle or median value as the estimate because the mean value is not very stable. I do not claim here that the median value is the best possible measurement for even three samples, just that it is better than the mean value in this particular case. What taking the median does in this case is make it less likely that the value chosen has a large error, it does not eliminate that error, it just reduces the likelihood of that error being large. Now comes the paradox that confuses most people. We did not eliminate the effect of the other two measurements by 'ignoring' them. Rather, we ranked three measurements and chose the most moderate ranked as being generally a more stable representation of the true value than the averaging more wild values would have allowed. When distributions are long tailed, that is not uncommon. Another example, suppose there are 10 people in a room who are colleagues and we want to know what salary we can expect by becoming a colleagues based on their incomes. Income is notoriously long-tailed, and if we take an average we will inflate our expectation severely if one of those colleagues makes seven figures a year and the others make only five or six. In that case, the median would be a better expectation of our future salary. However, we might have been better off still taking the average of the log of the salaries, (as in average "figures") and then the antilog of that average (geometric mean). Even better, provided all the colleagues are employed and actually earning money, would be to take the reciprocal of the average reciprocal of earnings, called the harmonic mean, which would be the least inflated estimate for a Pareto distribution (von Hippel, P.T., Scarpino, S.V., Holas, I.: Robust estimation of inequality from binned incomes. Sociological Methodology 46(1), 212--251 (2016). https://arxiv.org/pdf/1402.4061)

2A) It is frequently not necessary to trim outliers, just find a better, more physical data model. The most common phenomenon in biology is to use empirical data processing models that are inherently I) unstable II) and physically incorrect on two counts i) The units often do not balance and ii) The assumptions are frequently unphysical. Ad i) Discard completely unbalanced measurement systems, and rely only on their balanced equivalents. Personal experience on that includes an approximate two fold increase in precision/accuracy by ignoring body surface area "normalization" of glomerular filtration rate. A knowledge of proper body scaling is useful. Ad ii) The remedy for unrealistic assumptions is to write physically more correct models, e.g., discard 'instant mixing' replace with 'slow mixing'. Note, however, as George Box said, all models are wrong but some are useful. Personal experience on that is that by using more physical models, I was able for one biological problem to reduce the occurrence of physically ridiculous results by 20 fold or from 4% to <0.2%, while increasing precision and accuracy by ~2 fold.

2B) Sometimes it is necessary to account for outliers but even then the accounting may be better done by transforming variables and/or changing statistical models than by trimming results that one does not 'like'. In order to trim an outlier result, one should be prepared to state exactly what went wrong. That requires a lot more thinking than just using intuition. For example, it is frequently ignored that regression of a data analysis model should ALWAYS converge. When that is done some surprises can occur. From personal experience, the convergent answer can be a complex number solution, which when physically inappropriate, signals that the biological model used is literally unrealistic/unphysical. The remedy for that may be to use adaptively targeted regularization (or similar if less optimal techniques) and/or substitution of a more physical model.

3) Not quite the same but some similarity. As above, trimmed calculations can be more stable than the mean or median, see measures of location. The gist of this difference is that what you may think is an outlier, may actually be an expected occurrence for the typical range of some distribution types, e.g., the biologically very common Student's-t distributions of low degrees of freedom.

4) Sometimes the median is more stable than the mean, the mean can be undefined, e.g., Cauchy distribution, see link for measures of location above. That still does not justify the use of the median in other than relative terms, it is generally not, as above, an ideal measurement.