Solved – When should you remove Outliers – Entire Dataset or Train Dataset

data preprocessingmachine learningoutliers

I have been trying to understand the concepts of data leakage and outlier analysis as I am new to data analysis and machine learning. I have googled these topics and understand data leakage but it is not clear on when to perform outlier analysis.

To build a accurate and correct model, my understanding is:

  • Split dataset into train/test as first step and is done before any data cleaning and processing (e.g. null values, feature transformation, feature scaling). This is because the test data is used to simulate (see) how the model will perform if it was deployed in a real world scenario. Therefore you cannot clean/process the entire dataset.

  • Outlier detection (in general terms) should be done on the train dataset. This again simulates a real world scenario as the model will need to determine if there are any outliers and then take the correct action (e.g. remove, impute, cap to certain threshold). Checking outliers for the entire dataset (and doing some action) results in data leakage.

My qeustion is: should outlier detection/analysis be done on the training dataset or on the entire dataset before it is split into train/test?

I am trying to undersand what is the most common practice.
I understand that outlier detection is not as straightforward as above as other factors may need to be considered.

Note: When searching CrossValidated, there are lots of answers regarding data leakage from train/test split, but there is no clear answer on when to remove outliers.

Best Answer

In my opinion you cannot remain vague about "outliers" when asking such questions. The answer to your question will most likely depend on what you mean by outlier and what procedure will be used to deal with outliers. A few imaginary scenarios:

  1. You have photographs of animals and some of them are damaged by technical errors. In this case you would simply discard them from the entire dataset as they would equally be discarded in, as you put it, real world scenario.

  2. You have gene expression data and some genes have abnormally high expression levels. You decide to deal with this by capping the expression at some arbitrary threshold $c$. Since this is a within-sample procedure - meaning the results will be the same regardless of whether you process each sample one by one or all of them together - you can again perform this before splitting into training and testing.

  3. You have similar gene expression data as before with some abnormally high values but you decide to do a cross-validation to get an optimal threshold parameter $c$. Now you actually would have to do such outlier "normalization" step not only separately for testing and training data, but separately for each cross-validation fold.

  4. You have customer data from an insurance company where samples can have missing features. You decide to impute those features using average values from the samples of the same class. Here you would have to perform this correction after splitting into training and testing. And again - if you do cross validation - separately in each cross-validation fold.

In summary, your general observation about checking whether this procedure would transfer to the "real world" setting is on point. Or alternatively - you could get intuition by pondering whether a certain procedure can be performed on a single sample (such procedures are called "in-sample" or "within-sample" procedures). As an example you cannot subtract a feature-wise mean from a single sample because you will get all 0s.

When dealing with an "out-sample" (between-sample) procedure you have to make sure that any estimation (a.k.a. "learning") is always done using only the data that is being used for estimation ("training data"). Then, once you get a value in this training data you have to use the obtained values on the testing data. And yes - simple things like centering the data by subtracting a feature-wise mean is also "learning". So you get the mean in the training step and subtract this training-data-obtained mean in the testing stage.