I need some advice on what is a reasonable number of cases to be deleted as outliers.
I have applied outlier detection methods to identify univariate and multivariate outliers from my dataset. Alltogether 30% of the data was classified as outliers.
If I delete all of these outliers, my results appear to improve. Also, after deleting the outliers my sample size is still good (i.e., n=300).
- Is it reasonable to delete all the cases classified as outliers?
Best Answer
I would be more than suspicious, if someone told me that 30% of my sample are outliers ...
Rather than blindly trusting a canned routine I would carefully analyze the data and try to find out why an outlier is an outlier. Is it a "bug" or a "feature"? Is it measurement error? Does your sample cover different sub-populations (mixture)?
Moreover, the detection of outliers involves the more or less arbitrary definition of a threshold, which separates "good" and "bad". You should assess if these thresholds are sensible. It could thus be a good idea to move the goalposts and to see what happens.
Also note that rather than dropping observations, you could use robust statistical techniques if you are concerned about outliers.