Solved – Outlier detection using regression

outliersregression

Can regression be used for out lier detection. I understand that there are ways to improve a regression model by removing the outliers. But the primary aim here is not to fit a regression model but find out out liers using regression

Best Answer

Your best option to use regression to find outliers is to use robust regression.

Ordinary regression can be impacted by outliers in two ways:

First, an extreme outlier in the y-direction at x-values near $\bar x$ can affect the fit in that area in the same way an outlier can affect a mean.

Second, an 'outlying' observation in x-space is an influential observation - it can pull the fit of the line toward it. If it's sufficiently far away the line will go through the influential point:

enter image description here

In the left plot, there's a point that's quite influential, and it pulls the line quite a way from the large bulk of the data. In the right plot, it's been moved even further away -- and now the line goes through the point. When the x-value is that extreme, as you move that point up and down, the line moves with it, going through the mean of the other points and through the one influential point.

An influential point that's perfectly consistent with the rest of the data may not be such a big problem, but one that's far from a line through the rest of the data will make the line fit it, rather than the data.

If you look at the right-hand plot, the red line - the least squares regression line - doesn't show the extreme point as an outlier at all - its residual is 0. Instead, the large residuals from the least squares line are in the main part of the data!

This means you can completely miss an outlier.

Even worse, with multiple regression, an outlier in x-space may not look particularly unusual for any single x-variable. If there's a possibility of such a point, it's potentially a very risky thing to use least squares regression on.

Robust regression

If you fit a robust line - in particular one robust to influential outliers - like the green line in the second plot - then the outlier has a very large residual.

In that case, you have some hope of identifying outliers - they'll be points that aren't - in some sense - close to the line.


Removing outliers

You certainly can use a robust regression to identify and thereby remove outliers.

But once you have a robust regression fit, one that is already not badly affected by outliers, you don't necessarily need to remove the outliers -- you already have a model that's a good fit.