Solved – Should I remove any out-liers before splitting the data

data preprocessingoutliers

I've split my data into three sets before doing any pre-processing; training, validation and testing. I thought that any pre-processing tasks have to take place after splitting the data. However, some online posts seem to be saying that any outlying values should be removed (if they are to be removed) before the data is split.

Should I run my outlier removal on the entire data set and then split it, or only run it on the training + validation data or run it separately on all three data partitions?

Best Answer

The answer is "it depends". You haven't told us the nature of your data, the nature of those outliers, or how you identify outliers as outliers. In some cases, those so-called outliers are not outliers at all. A better model would attribute them to some cause. An example: Alaskan North Slope climate change just outran one of our tools to measure it. In this case, automated outlier detection masked the climate data for Utqiaġvik, Alaska as missing "for all of 2017 and the last few months of 2016".

In other cases, there is no model other than the datum in question is bad (bad recording, bad transmission, ...), in which case editing it out may well be the best thing to do. Regardless of how robust a technique is, I've yet to see a technique that is robust against 60 sigma outliers. Given any reasonable distribution, you'll never, ever see a true 60 sigma outlier. Yet they do happen all the time. A high order bit can flip from a zero to a one due to noisy transmission, or a manually recorded piece of data can have a misplaced decimal or be expressed in the wrong units.