Solved – How to identify and remove outliers in R

outliersrregression

I am performing regression analysis on prices of product that we have purchased, based on size and other attributes.

However there are often buys in odd circumstances which factor into the price, that is not (and cannot be) addressed directly in the features of the analysis.

Each time I run a regression, I will check the 20 with the largest error manually, and 90%+ of the time they will be odd buys like mentioned before, and for my purposes can be completely ignored.

I have been looking into cooks distance to remove these, however I'm not sure how to best set the threshold, or if there is a better method to use.

Best Answer

However there are often buys in odd circumstances which factor into the price, that is not (and cannot be) addressed directly in the features of the analysis.

Isn't that what error terms in a regression are supposed to capture: variation in the outcome variable that isn't explained by the features of your model?

If your question is how to deal with outliers in general under the assumption that extreme observations are probably bad data

Some standard approaches are:

Trimming the data. Eg. ignore 1% of most extreme observations.
Winsorizing the data. Replace observations above or below some cutoff with the value of the cutoff. (This isn't quite extreme as trimming the data, which deletes extreme observations entirely.)

Some fancier approaches to outliers (ignore if this is at all confusing):

You can do things like ellipsoidal peeling. Find the minimum volume ellipsoid which encloses your data than remove observations along the surface.
Estimate regression with Huber Loss function or something less sensitive to outliers than OLS. Or maybe maximum likelihood estimator with t distributed rather than normal distributed errors, etc...
Quantile regression.
You could adopt some Bayesian view as to whether an observation is bad data.

Beware the problems of mishandling outliers...

In many cases, such as returns for financial securities, removing or ignoring outliers can be hugely problematic. Often times, all the action is in the outliers! Major stock market crashes, company bankruptcies, etc... are hugely important.

For situations involving safety, (eg. auto-crashes etc...), ignoring bad outliers can be even worse! You don't want to winsorize observations such that observations where people die get replaced with observations where people are mildly injured. That would be perhaps criminal negligence.

Related Solutions

Solved – How to design and implement an asymmetric loss function for regression

As mentioned in the comments above, quantile regression uses an asymmetric loss function ( linear but with different slopes for positive and negative errors). The quadratic (squared loss) analog of quantile regression is expectile regression.

You can google quantile regression for the references. For expectile regression see the R package expectreg and the references in the reference manual.

Solved – Regression analysis in R using text field

First, I'd split each text description into words. there are several ways to do it. the simplest is by using strsplit with the correct split argument.

what you get is a list of character vectors each containing a word. note: if you choose bad split arguments you'll end up with lot's of garbage, which might not be really bad, you can filter some of the garbage later.

all.words = strsplit(descriptions,c(" ",","))

Now, I'd have a combined list of words:

words = unlist(all.words)
word.count = table(words)

Now I'd choose only words that appear several times (in my example 3):

chosen.words = names(word.count)[word.count>3]

Now for each word and for each case in your data I'd add an indicator variable, telling whether the given word appeared in the description of the given item

With this new data, you have a new variable for each word, and you can add these variables to your regression, and the coefficient will tell you the relative contribution of this word to price.

HTH