Solved – When should you remove Outliers – Entire Dataset or Train Dataset

data preprocessingmachine learningoutliers

I have been trying to understand the concepts of data leakage and outlier analysis as I am new to data analysis and machine learning. I have googled these topics and understand data leakage but it is not clear on when to perform outlier analysis.

To build a accurate and correct model, my understanding is:

Split dataset into train/test as first step and is done before any data cleaning and processing (e.g. null values, feature transformation, feature scaling). This is because the test data is used to simulate (see) how the model will perform if it was deployed in a real world scenario. Therefore you cannot clean/process the entire dataset.
Outlier detection (in general terms) should be done on the train dataset. This again simulates a real world scenario as the model will need to determine if there are any outliers and then take the correct action (e.g. remove, impute, cap to certain threshold). Checking outliers for the entire dataset (and doing some action) results in data leakage.

My qeustion is: should outlier detection/analysis be done on the training dataset or on the entire dataset before it is split into train/test?

I am trying to undersand what is the most common practice.
I understand that outlier detection is not as straightforward as above as other factors may need to be considered.

Note: When searching CrossValidated, there are lots of answers regarding data leakage from train/test split, but there is no clear answer on when to remove outliers.

Best Answer

In my opinion you cannot remain vague about "outliers" when asking such questions. The answer to your question will most likely depend on what you mean by outlier and what procedure will be used to deal with outliers. A few imaginary scenarios:

You have photographs of animals and some of them are damaged by technical errors. In this case you would simply discard them from the entire dataset as they would equally be discarded in, as you put it, real world scenario.
You have gene expression data and some genes have abnormally high expression levels. You decide to deal with this by capping the expression at some arbitrary threshold $c$. Since this is a within-sample procedure - meaning the results will be the same regardless of whether you process each sample one by one or all of them together - you can again perform this before splitting into training and testing.
You have similar gene expression data as before with some abnormally high values but you decide to do a cross-validation to get an optimal threshold parameter $c$. Now you actually would have to do such outlier "normalization" step not only separately for testing and training data, but separately for each cross-validation fold.
You have customer data from an insurance company where samples can have missing features. You decide to impute those features using average values from the samples of the same class. Here you would have to perform this correction after splitting into training and testing. And again - if you do cross validation - separately in each cross-validation fold.

In summary, your general observation about checking whether this procedure would transfer to the "real world" setting is on point. Or alternatively - you could get intuition by pondering whether a certain procedure can be performed on a single sample (such procedures are called "in-sample" or "within-sample" procedures). As an example you cannot subtract a feature-wise mean from a single sample because you will get all 0s.

When dealing with an "out-sample" (between-sample) procedure you have to make sure that any estimation (a.k.a. "learning") is always done using only the data that is being used for estimation ("training data"). Then, once you get a value in this training data you have to use the obtained values on the testing data. And yes - simple things like centering the data by subtracting a feature-wise mean is also "learning". So you get the mean in the training step and subtract this training-data-obtained mean in the testing stage.

Related Solutions

Solved – Is it reasonable to delete a large number of outliers from a dataset

I would be more than suspicious, if someone told me that 30% of my sample are outliers ...

Rather than blindly trusting a canned routine I would carefully analyze the data and try to find out why an outlier is an outlier. Is it a "bug" or a "feature"? Is it measurement error? Does your sample cover different sub-populations (mixture)?

Moreover, the detection of outliers involves the more or less arbitrary definition of a threshold, which separates "good" and "bad". You should assess if these thresholds are sensible. It could thus be a good idea to move the goalposts and to see what happens.

Also note that rather than dropping observations, you could use robust statistical techniques if you are concerned about outliers.

Solved – When to remove outliers

Outliers are not always a bad thing.

Sometimes they reflect the stochastic nature of the data (e.g. data in finance tend to have heavy tails, and it is common to observe "outliers"),
in other instances, they may be explained by covariates.

For example,

set.seed(1)
x = c(21,22,23,24,25,50)
y = 5 + 2*x + rnorm(length(x)) 
> y
[1]  46.37355  49.18364  50.16437  54.59528  55.32951 104.17953

One could think that the largest observation is an outlier, but it is clearly explained by the covariate $x$, and the residual errors are of course normal.

In other cases the presence of outliers might be related to data quality (e.g. a typo).
Among other possible reasons.

Thus, in general, it is better to reflect about potential reasons for having outliers, rather than automatically and blindly applying methods to detect outliers.

A nice quote from Andrew Gelman:

Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticans but are considered by statisticians to be a bit of a joke.

Reference for the quote: https://statmodeling.stat.columbia.edu/2014/06/02/hate-stepwise-regression/

Best Answer

Related Solutions

Solved – Is it reasonable to delete a large number of outliers from a dataset

Solved – When to remove outliers

Related Question