I am currently doing a linear regression model. At the suggestion of my professor, we have looked at Cook’s distance to identify outliers. Here is the Cook’s distance plot using R. From what I understand, this shows that points 6 and 24 are influential.
But how should that affect our analysis? Does this mean we should eliminate these points? According to our datasets background, the data is reliable. I read somewhere else that unless you have a specific reason to remove an outlier you should always keep it. Is this true?
Best Answer
Outliers are not always a bad thing.
Sometimes they reflect the stochastic nature of the data (e.g. data in finance tend to have heavy tails, and it is common to observe "outliers"),
in other instances, they may be explained by covariates.
For example,
One could think that the largest observation is an outlier, but it is clearly explained by the covariate $x$, and the residual errors are of course normal.
In other cases the presence of outliers might be related to data quality (e.g. a typo).
Among other possible reasons.
Thus, in general, it is better to reflect about potential reasons for having outliers, rather than automatically and blindly applying methods to detect outliers.
A nice quote from Andrew Gelman:
Reference for the quote: https://statmodeling.stat.columbia.edu/2014/06/02/hate-stepwise-regression/