Solved – Regression analysis vs outlier detection

outliersregression

I have seen many examples where even if the goal is to find outliers, the first step is to go through linear (or any other) regression and then move on to outlier identification. Why?

If one wishes to find outliers only, can't regression analysis be skipped by using methods like 68–95–99.7 rule?

Best Answer

Checking for outliers in a single time series by using 1-, 2-, or 3-standard deviations as thresholds of outlierness is often of limited interest. Usually, this type of investigation is associated with an explicit dependent variable Y and several independent X variables. In other words, your objective is to estimate and explain the behavior of Y by a bunch of associated or causal variables Xs.

Given the above, what does investigating outliers entail? Well, there are two types of outliers. The first type consists of estimates associated with large errors or residuals. This type of outliers is measured as studentized residuals. Your second type consists of X outliers, where their values at a given point deviates significantly from their average. This type of outlier is called Hat-Leverage.

So, what do you do with these studentized residuals and Hat-Leverage measures? They are actually combined into a third and last measure of outlierness: Cook's D value. The higher this value the more a specific datapoint will have an influence on the regression coefficients of your regression model. To fully understand those concepts learn how to interpret and graph an Influence Plot in R.

An Influence Plot in R will readily identify for you the data points that have the most influence on your regression coefficients. So, next you rerun your regression by removing one of those influential data point at a time (with the highest Cook's D value), and observe how much your regression coefficients have moved. If a couple of data points pull in the same direction you may consider removing them both from your regression and observe how much your regression coefficients have moved.

If you are interested in testing your regression for outliers further, there is an entire family of Robust Regressions catered to do exactly that (Ridge Regression, Quantile Regression, M-M Regression). So, you could rerun your model using one or more of those Robust Regression models.

Ideally, when you rerun your regressions removing the most influential outliers or when you use a Robust Regression, in all cases your regression coefficients would remain relatively stable vs. your original regression. Similarly, the statistical significance of all your independent variables Xs should also remain reasonably stable. If they are not stable, you may have explicit justifications or explanations for the mentioned instability. And, that is fine. If you do not have an acceptable explanation it suggests that you consider removing the variables associated with unstable coefficients and weakening statistical significance.

Outlier testing is one of the most important aspect of testing a model.

Notice that in this entire discourse, I did not speak of any outliers of Y. By themselves, they are not always interesting. On the other hand, the difference between Y and Y estimate (the residual) is extremely interesting. But, as indicated residual outliers are only part of outlier investigation and testing.

Related Question