Solved – Extreme values in the data

extreme valuewinsorizing

I have a very general statistical question. If a variable has some extreme values, then for the purpose of statistical inferences for example OLS regression, is it better to detect these extreme values and remove them from the data. And if yes, then what is the statistical explanation?

Best Answer

A key distinction: mismeasurement or extreme events?

Are extreme values due to extreme events or error? You generally want to include the former but exclude the latter. You don't want your results driven by error. More generally, you don't want results driven by bizarre, weird behavior that's not related to what you're trying to model.

Some examples:

  • In finance, excluding extreme events like bankruptcy would be a horrible mistake. It is often the extreme observations (eg. deaths, -100% returns, crashes) that you really care about!
  • On the other hand, financial data isn't perfect. You can find cases where decimal points are in the wrong place, 100.00 is mistakenly recorded as 10000 etc...
  • There's often fuzzy stuff in between...

A key distinction: left hand side or right hand side variables?

Dropping observations conditional on the value of a left hand side variable tends to be problematic. It can easily qualify as research misconduct, like trying to estimate the effects of schooling and dropping all the low test scores under some dubious argument that they somehow don't count.

Depending on context, transforming right hand side variables can be OK. There's often more flexibility on what you're using to try to predict or explain the data.

Some techniques that can be valid (depending on context):

For example in accounting data, you often have a few companies with bizarre, extreme numbers and you want to give ordinary least squares regression a reasonable shot at fitting something other than the few outliers. To reduce the effect of outliers, you can:

  1. Trim the data (eg. drop 1 percent most extreme observations). This is most reasonable if the outliers are almost certainly entirely wrong. (eg. an entry for a human's height of -2 feet or 135 feet). You can go seriously wrong by trimming the data though.
  2. Arguably better is to winsorize the data: eg. replace values above the 99th percentile with value of the 99th percentile.
  3. More complicated outlier detection systems such as ellipsodial peeling: find the minimum volume ellipse that encloses the data and then drop points on the boundary.

Robust methods:

There are other types of regression that may be more robust to extreme outliers.

  1. Quantile regression (eg. fit the median)
  2. Instead of minimizing sum of squares, use the Huber loss function or something with less penalty for big outliers.

There are a lot of different approaches people use to deal with outliers, and what's reasonable often depends on context.

Related Question