Solved – When to remove outliers

cooks-distancelinearoutliersrregression

I am currently doing a linear regression model. At the suggestion of my professor, we have looked at Cook’s distance to identify outliers. Here is the Cook’s distance plot using R. From what I understand, this shows that points 6 and 24 are influential.

But how should that affect our analysis? Does this mean we should eliminate these points? According to our datasets background, the data is reliable. I read somewhere else that unless you have a specific reason to remove an outlier you should always keep it. Is this true?

Best Answer

Outliers are not always a bad thing.

Sometimes they reflect the stochastic nature of the data (e.g. data in finance tend to have heavy tails, and it is common to observe "outliers"),
in other instances, they may be explained by covariates.

For example,

set.seed(1)
x = c(21,22,23,24,25,50)
y = 5 + 2*x + rnorm(length(x)) 
> y
[1]  46.37355  49.18364  50.16437  54.59528  55.32951 104.17953

One could think that the largest observation is an outlier, but it is clearly explained by the covariate $x$, and the residual errors are of course normal.

In other cases the presence of outliers might be related to data quality (e.g. a typo).
Among other possible reasons.

Thus, in general, it is better to reflect about potential reasons for having outliers, rather than automatically and blindly applying methods to detect outliers.

A nice quote from Andrew Gelman:

Stepwise regression is one of these things, like outlier detection and pie charts, which appear to be popular among non-statisticans but are considered by statisticians to be a bit of a joke.

Reference for the quote: https://statmodeling.stat.columbia.edu/2014/06/02/hate-stepwise-regression/

Best Answer

Related Solutions

Outliers – How to Detect Initial Trend or Outliers in Data

Related Question