Mean – How to Replace Outliers with Mean for Robust Data Analysis

meanoutliersrobustwinsorizing

This question was asked by my friend who is not internet savvy. I've no statistics background and I've been searching around internet for this question.

The question is : is it possible to replace outliers with mean value? if it's possible, is there any book reference/journals to backup this statement?

Best Answer

Clearly it's possible, but it's not clear that it could ever be a good idea.

Let's spell out several ways in which this is a limited or deficient solution:

  • In effect you are saying that the outlier value is completely untrustworthy, to the extent that your only possible guess is that the value should be the mean. If that's what you think, it is likely to be more honest just to omit the observation in question, as evidently you don't have enough information to make a better guess.

  • With nothing else said, you need a criterion or criteria for identifying outliers in the first place (as implied by @Frank Harrell). Otherwise this is an arbitrary and subjective procedure, even if it is defended as a matter of judgment. With some criteria, it is possible that removing outliers in this way creates yet more outliers as a side-effect. An example could be that outliers are more than so many standard deviations away from the mean. Removing an outlier changes the standard deviation, and new data points may now qualify, and so on.

  • Presumably the mean here means the mean of all the other values, a point made explicit by @David Marx. The idea is ambiguous without this stipulation.

  • Using the mean may seem a safe or conservative procedure, but changing a value to the mean will change almost every other statistic, including measures of level, scale and shape and indicators of their uncertainty, a point emphasized by @whuber.

  • The mean may not even be a feasible value: simple examples are when values are integers, but typically the mean isn't an integer.

  • Even with the idea that using a summary measure is a cautious thing to do, using the mean rather than the median or any other measure needs some justification.

  • Whenever there are other variables, modifying the value of one variable without reference to others may make a data point anomalous in other senses.

What to do with outliers is an open and very difficult question. Loosely, different solutions and strategies have varying appeal.

As a very broad-brush generalisation, there is a continuum of views on outliers in statistics and machine learning from extreme pessimists to extreme optimists. Extreme pessimists feel called to serve as if officers of a Statistical Inquisition, whose duty it is to find outliers as obnoxious contaminants in the data and to deal with them severely. This could be the position, say, of people dealing with financial transactions data, most honest or genuine, but some fraudulent or criminal. Extreme optimists know that outliers are likely, and usually genuine -- the Amazon, or Amazon, is real enough, and really big. Indeed, outliers are often interesting and important and instructive. Floods, fires, and financial crises are what they are, and some are very big.

Here is a partial list of possibilities. The ordering is arbitrary and not meant to convey any order in terms of applicability, importance or any other criterion. Nor are these approaches mutually exclusive.

  • One (in my view good) definition is that "[o]utliers are sample values that cause surprise in relation to the majority of the sample" (W.N. Venables and B.D. Ripley. 2002. Modern Applied Statistics with S. New York: Springer, p.119). However, surprise is in the mind of the beholder and is dependent on some tacit or explicit model of the data. There may be another model under which the outlier is not surprising at all, so the data really are (say) lognormal or gamma rather than normal. In short, be prepared to (re)consider your model.

  • Go into the laboratory or the field and do the measurement again. Often this is not practicable, but it would seem standard in several sciences.

  • Test whether outliers are genuine. Most of the tests look pretty contrived to me, but you might find one that you can believe fits your situation. Irrational faith that a test is appropriate is always needed to apply a test that is then presented as quintessentially rational.

  • Throw them out as a matter of judgement.

  • Throw them out using some more-or-less automated (usually not "objective") rule.

  • Ignore them, partially or completely. This could be formal (e.g. trimming) or just a matter of leaving them in the dataset, but omitting them from analyses as too hot to handle.

  • Pull them in using some kind of adjustment, e.g. Winsorizing.

  • Downplay them by using some other robust estimation method.

  • Downplay them by working on a transformed scale.

  • Downplay them by using a non-identity link function.

  • Accommodate them by fitting some appropriate fat-, long-, or heavy-tailed distribution, without or with predictors.

  • Accommodate by using an indicator or dummy variable as an extra predictor in a model.

  • Side-step the issue by using some non-parametric (e.g. rank-based) procedure.

  • Get a handle on the implied uncertainty using bootstrapping, jackknifing or permutation-based procedure.

  • Edit to replace an outlier with some more likely value, based on deterministic logic. "An 18- year-old grandmother is unlikely, but the person in question was born in 1932, and it's now 2013, so presumably is really 81."

  • Edit to replace an impossible or implausible outlier using some imputation method that is currently acceptable not-quite-white magic.

  • Analyse with and without, and seeing how much difference the outlier(s) make(s), statistically, scientifically or practically.

  • Something Bayesian. My prior ignorance of quite what forbids from giving any details.

EDIT This second edition benefits from other answers and comments. I've tried to flag my sources of inspiration.