I am performing regression analysis on prices of product that we have purchased, based on size and other attributes.
However there are often buys in odd circumstances which factor into the price, that is not (and cannot be) addressed directly in the features of the analysis.
Each time I run a regression, I will check the 20 with the largest error manually, and 90%+ of the time they will be odd buys like mentioned before, and for my purposes can be completely ignored.
I have been looking into cooks distance to remove these, however I'm not sure how to best set the threshold, or if there is a better method to use.
Best Answer
If your question is how to deal with outliers in general under the assumption that extreme observations are probably bad data
Some standard approaches are:
Some fancier approaches to outliers (ignore if this is at all confusing):
Beware the problems of mishandling outliers...
In many cases, such as returns for financial securities, removing or ignoring outliers can be hugely problematic. Often times, all the action is in the outliers! Major stock market crashes, company bankruptcies, etc... are hugely important.
For situations involving safety, (eg. auto-crashes etc...), ignoring bad outliers can be even worse! You don't want to winsorize observations such that observations where people die get replaced with observations where people are mildly injured. That would be perhaps criminal negligence.