Solved – Is it reasonable to delete a large number of outliers from a dataset

outliers

I need some advice on what is a reasonable number of cases to be deleted as outliers.

I have applied outlier detection methods to identify univariate and multivariate outliers from my dataset. Alltogether 30% of the data was classified as outliers.

If I delete all of these outliers, my results appear to improve. Also, after deleting the outliers my sample size is still good (i.e., n=300).

Is it reasonable to delete all the cases classified as outliers?

Best Answer

I would be more than suspicious, if someone told me that 30% of my sample are outliers ...

Rather than blindly trusting a canned routine I would carefully analyze the data and try to find out why an outlier is an outlier. Is it a "bug" or a "feature"? Is it measurement error? Does your sample cover different sub-populations (mixture)?

Moreover, the detection of outliers involves the more or less arbitrary definition of a threshold, which separates "good" and "bad". You should assess if these thresholds are sensible. It could thus be a good idea to move the goalposts and to see what happens.

Also note that rather than dropping observations, you could use robust statistical techniques if you are concerned about outliers.

Best Answer

Related Solutions

Solved – Whether to leave the data unaltered in the face of outliers and non-normality when performing structural equation modelling

Solved – Outlier detection/imputation – discussion

Related Question