Solved – How to identify and remove outliers in R

outliersrregression

I am performing regression analysis on prices of product that we have purchased, based on size and other attributes.

However there are often buys in odd circumstances which factor into the price, that is not (and cannot be) addressed directly in the features of the analysis.

Each time I run a regression, I will check the 20 with the largest error manually, and 90%+ of the time they will be odd buys like mentioned before, and for my purposes can be completely ignored.

I have been looking into cooks distance to remove these, however I'm not sure how to best set the threshold, or if there is a better method to use.

Best Answer

However there are often buys in odd circumstances which factor into the price, that is not (and cannot be) addressed directly in the features of the analysis.

  • Isn't that what error terms in a regression are supposed to capture: variation in the outcome variable that isn't explained by the features of your model?

If your question is how to deal with outliers in general under the assumption that extreme observations are probably bad data

Some standard approaches are:

  • Trimming the data. Eg. ignore 1% of most extreme observations.
  • Winsorizing the data. Replace observations above or below some cutoff with the value of the cutoff. (This isn't quite extreme as trimming the data, which deletes extreme observations entirely.)

Some fancier approaches to outliers (ignore if this is at all confusing):

  • You can do things like ellipsoidal peeling. Find the minimum volume ellipsoid which encloses your data than remove observations along the surface.
  • Estimate regression with Huber Loss function or something less sensitive to outliers than OLS. Or maybe maximum likelihood estimator with t distributed rather than normal distributed errors, etc...
  • Quantile regression.
  • You could adopt some Bayesian view as to whether an observation is bad data.

Beware the problems of mishandling outliers...

In many cases, such as returns for financial securities, removing or ignoring outliers can be hugely problematic. Often times, all the action is in the outliers! Major stock market crashes, company bankruptcies, etc... are hugely important.

For situations involving safety, (eg. auto-crashes etc...), ignoring bad outliers can be even worse! You don't want to winsorize observations such that observations where people die get replaced with observations where people are mildly injured. That would be perhaps criminal negligence.

Related Question