Solved – When to remove outliers

cooks-distancelinearoutliersrregression

I am currently doing a linear regression model. At the suggestion of my professor, we have looked at Cook’s distance to identify outliers. Here is the Cook’s distance plot using R. From what I understand, this shows that points 6 and 24 are influential.

But how should that affect our analysis? Does this mean we should eliminate these points? According to our datasets background, the data is reliable. I read somewhere else that unless you have a specific reason to remove an outlier you should always keep it. Is this true?

enter image description here

Best Answer

Outliers are not always a bad thing.

  • Sometimes they reflect the stochastic nature of the data (e.g. data in finance tend to have heavy tails, and it is common to observe "outliers"),

  • in other instances, they may be explained by covariates.

For example,

set.seed(1)
x = c(21,22,23,24,25,50)
y = 5 + 2*x + rnorm(length(x)) 
> y
[1]  46.37355  49.18364  50.16437  54.59528  55.32951 104.17953

One could think that the largest observation is an outlier, but it is clearly explained by the covariate $x$, and the residual errors are of course normal.

  • In other cases the presence of outliers might be related to data quality (e.g. a typo).

  • Among other possible reasons.

Thus, in general, it is better to reflect about potential reasons for having outliers, rather than automatically and blindly applying methods to detect outliers.

A nice quote from Andrew Gelman:

Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticans but are considered by statisticians to be a bit of a joke.

Reference for the quote: https://statmodeling.stat.columbia.edu/2014/06/02/hate-stepwise-regression/

Related Question