I have a very general statistical question. If a variable has some extreme values, then for the purpose of statistical inferences for example OLS regression, is it better to detect these extreme values and remove them from the data. And if yes, then what is the statistical explanation?
Solved – Extreme values in the data
extreme valuewinsorizing
Related Solutions
These are not outliers. I have a paper on this and several working papers related to this. Do not trim or Windsorize your data.
The distributions involved lack a mean, so there is no such thing as ANOVA or OLS regression. The distributions involved also lack sufficient statistics. In practice, this limits you to only a few choices.
The first choice is to take the log of the data. Taking the log may or may not bias the estimates depending on the theoretical construction you are testing. In some cases, the bias is quite substantial. In others, it is zero. There is also a theoretical issue. These distributions do not have a covariance matrix even in log form. There is a scale matrix, but it collapses to a single point. You would need to be careful in interpreting the coefficients. While there isn't a covariance for the data, there is a variance.
The second choice is quantile regression or Theil's regression. The advantage to either of these is that you minimize the absolute linear loss, but, while it is an unbiased estimator of the median, the median isn't $\mu$ and so relative to the true center of location you have a bias. In fact, this guarantees the bias that logarithms may avoid. The advantage over the log transformation is that ranks are sufficient statistics because the sample is sufficient. You will not lose information where you will with the log transformation.
The third choice is some form of Bayesian solution. The problem is that we do not know your distribution. The advantage of a Bayesian solution is that the likelihood function is always minimally sufficient. This is slightly different than saying all the information is in the statistic, but all the information to calculate the likelihood is in the statistic and there can be no greater reduction in dimension. All Bayesian solutions with a proper prior are biased solutions, but the bias is usually quite a bit less than the bias on the Frequentist side from the distinction of the median minus $\mu$.
Bayesian solutions also have the difficulty of being labor intensive unless you are used to using them. A Bayesian solution would "solve" all of your problems, but they have a significant learning curve and you need a reasonable prior density. The prior density contains your beliefs about the parameters before you actually saw the data. Because you would have different beliefs than I would, the result is always subjective.
You would also need a good set of possible guesses as to the potential density function involved unless you could derive the density from first principles.
As for Frequentist tests, you should pick up a good book on non-parametric and distribution-free statistics. Kruskal–Wallis one-way analysis of variance may accomplish what you are looking for in ANOVA, but it may not. This is a case where you really want to pick up a textbook. There are a ton of them on Amazon. My books are rather old and there have been advances, so I suggest an academic library as your first stop. Do not go down the path of Bayesian nonparametric statistics unless you are an advanced user of statistics. Frequentist non-parametric and distribution-free methods are very simple and simplicity is its own virtue here. If you happen to find a book on Amazon on Bayesian non-parametric statistics, don't buy it. The learning curve will be nearly vertical.
There are books explicitly on non-parametric econometrics. You should look at those.
Best Answer
A key distinction: mismeasurement or extreme events?
Are extreme values due to extreme events or error? You generally want to include the former but exclude the latter. You don't want your results driven by error. More generally, you don't want results driven by bizarre, weird behavior that's not related to what you're trying to model.
Some examples:
A key distinction: left hand side or right hand side variables?
Dropping observations conditional on the value of a left hand side variable tends to be problematic. It can easily qualify as research misconduct, like trying to estimate the effects of schooling and dropping all the low test scores under some dubious argument that they somehow don't count.
Depending on context, transforming right hand side variables can be OK. There's often more flexibility on what you're using to try to predict or explain the data.
Some techniques that can be valid (depending on context):
For example in accounting data, you often have a few companies with bizarre, extreme numbers and you want to give ordinary least squares regression a reasonable shot at fitting something other than the few outliers. To reduce the effect of outliers, you can:
Robust methods:
There are other types of regression that may be more robust to extreme outliers.
There are a lot of different approaches people use to deal with outliers, and what's reasonable often depends on context.