Solved – Is it reasonable to delete a large number of outliers from a dataset

outliers

I need some advice on what is a reasonable number of cases to be deleted as outliers.

I have applied outlier detection methods to identify univariate and multivariate outliers from my dataset. Alltogether 30% of the data was classified as outliers.

If I delete all of these outliers, my results appear to improve. Also, after deleting the outliers my sample size is still good (i.e., n=300).

  • Is it reasonable to delete all the cases classified as outliers?

Best Answer

I would be more than suspicious, if someone told me that 30% of my sample are outliers ...

Rather than blindly trusting a canned routine I would carefully analyze the data and try to find out why an outlier is an outlier. Is it a "bug" or a "feature"? Is it measurement error? Does your sample cover different sub-populations (mixture)?

Moreover, the detection of outliers involves the more or less arbitrary definition of a threshold, which separates "good" and "bad". You should assess if these thresholds are sensible. It could thus be a good idea to move the goalposts and to see what happens.

Also note that rather than dropping observations, you could use robust statistical techniques if you are concerned about outliers.