Outliers – How to Remove Sample Outliers Using Standard Deviation

outliers

I am looking to find find clinical and other measurements to predict a blood metabolite with Elastic-Net Regression models.

Can I remove samples with values greater than 1.96 SD from the mean as outliers? I read in a post stating if the samples are normal (e.g. disease-free), it could be fine.

The samples are collected from 10 different sites within the United States and then processed in the lab to test a specific cellular function, which gives a number indicating the efficiency of the process. To harmonize the data, I thought removing the outliers are important. But I am still not sure this is a good approach. I noticed a non-linear model performance significantly improved when removing samples greater than 2 SD. And when only using values within 1 SD of the mean (really cuts down a lot of samples), mean squared error is reduced as well.

The sample size is approximately 6000.

The goal is to perform both non-linear regression and the linear repeated ElasticNet models.

Best Answer

The use of a standard deviation-based threshold for "outlier" detection is generally not a good idea. I think there may be some confusion as to what "normal data" mean as this refers to the statistical distribution of the data, not a qualitative description of the population from which they derive. Your post suggests to me that there may have been a confusion that being from a disease-free population meant the data were "normal" when, in fact, the normality discussed in the post you linked is that of the standard normal distribution in statistics.

Perhaps the link was not the correct one as I also did not see anywhere there where the recommendation was made to omit cases as outliers when they occur beyond some number of standard deviations from the mean. This criterion doesn't make sense for outlier detection because we expect there to be values of certain extremes as a function of the normal distribution. In other words, by the definition of the normal distribution, we expect ~5% of the sample data to fall outside of 1.96 standard deviations from the mean. This does not make them outliers, they just make them rarer "extremes" in the distribution. This is before considering the issue raised by @whuber wherein the presence of outliers will increase the standard deviation anyway.

Now, to the issue of your noted model performance change when omitting the "outliers." The general gist of linear regression models is to predict some kind of a conditional mean (with some caveats with respect to simplification obviously). When extreme cases are omitted, then we are left with cases whose central tendencies are all relatively alike with reduced variation. You mention that the MSE improves when omitting those cases beyond certain standard deviations, which is an almost guaranteed because you are selectively omitting cases that will have large deviations from the mean. Thinking about the equation for MSE, residuals that are very large get squared (to make them positive) and thus get even larger, and these very large residuals are more likely in cases where the raw data are far from the mean to begin with. The MSE thus is a biased indicator of model performance (in this case), and I'd recommend looking at things like the predictive distribution plots to see whether the model is actually makes realistic predictions of the data rather than just how large residuals are on average.

To the question of outliers, you may consider thinking about identifying influential cases on the model and formal outlier detection methods. There are many univariate and multivariate outlier tests, but the overall identification of outliers is sometimes questionable as it may be better to think about outliers as arising from unique data generation processes rather than providing irrelevant information about the model. When outliers represent clearly incorrect data (e.g., data entry error, experimenter issue, out-of-range value), then it is more justifiable to remove those observations. It sounds like you may be concerned specifically with outliers caused by differences in your sites. If that's the case, then you may transition to multilevel models where each site is a grouping variable that can have random intercepts and slopes. This gets back to, ultimately, choosing a model that reflects your beliefs about what is causing the data you've observed.