Robust Linear Regression – Fast Methods Resistant to Outliers

fused-lassolinear modeloutliersregressionrobust

I am dealing with linear data with outliers, some of which are at more the 5 standard deviations away from the estimated regression line. I'm looking for a linear regression technique that reduces the influence of these points.

So far what I did is to estimate the regression line with all the data, then discard the data point with very large squared residuals (say the top 10%) and repeated the regression without those points.

In the literature there are lots of possible approaches: least trimmed squares, quantile regression , m-estimators, etc. I really don't know which approach I should try, so I'm looking for suggestions. The important for me is that the chosen method should be fast because the robust regression will be computed at
each step of an optimization routine. Thanks a lot!

Best Answer

If your data contains a single outlier, then it can be found reliably using the approach you suggest (without the iterations though). A formal approach to this is

Cook, R. Dennis (1979). Influential Observations in Linear Regression. Journal of the American Statistical Association (American Statistical Association) 74 (365): 169–174.

For finding more than one outlier, for many years, the leading method was the so-called $M$-estimation family of approach. This is a rather broad family of estimators that includes Huber's $M$ estimator of regression, Koenker's L1 regression as well as the approach proposed by Procastinator in his comment to your question. The $M$ estimators with convex $\rho$ functions have the advantage that they have about the same numerical complexity as a regular regression estimation. The big disadvantage is that they can only reliably find the outliers if:

  • the contamination rate of your sample is smaller than $\frac{1}{1+p}$ where $p$ is the number of design variables,
  • or if the outliers are not outlying in the design space (Ellis and Morgenthaler (1992)).

You can find good implementation of $M$ ($l_1$) estimates of regression in the robustbase (quantreg) R package.

If your data contains more than $\lfloor\frac{n}{p+1}\rfloor$ outlier potentially also outlying on the design space, then, finding them amounts to solving a combinatorial problem (equivalently the solution to an $M$ estimator with re-decending/non-convex $\rho$ function).

In the last 20 years (and specially last 10) a large body of fast and reliable outlier detection algorithms have been designed to approximately solve this combinatorial problem. These are now widely implemented in the most popular statistical packages (R, Matlab, SAS, STATA,...).

Nonetheless, the numerical complexity of finding outliers with these approaches is typically of order $O(2^p)$. Most algorithm can be used in practice for values of $p$ in the mid teens. Typically these algorithms are linear in $n$ (the number of observations) so the number of observation isn't an issue. A big advantage is that most of these algorithms are embarrassingly parallel. More recently, many approaches specifically designed for higher dimensional data have been proposed.

Given that you did not specify $p$ in your question, I will list some references for the case $p<20$. Here are some papers that explain this in greater details in these series of review articles:

Rousseeuw, P. J. and van Zomeren B.C. (1990). Unmasking Multivariate Outliers and Leverage Points. Journal of the American Statistical Association, Vol. 85, No. 411, pp. 633-639.

Rousseeuw, P.J. and Van Driessen, K. (2006). Computing LTS Regression for Large Data Sets. Data Mining and Knowledge Discovery archive Volume 12 Issue 1, Pages 29 - 45.

Hubert, M., Rousseeuw, P.J. and Van Aelst, S. (2008). High-Breakdown Robust Multivariate Methods. Statistical Science, Vol. 23, No. 1, 92–119

Ellis S. P. and Morgenthaler S. (1992). Leverage and Breakdown in L1 Regression. Journal of the American Statistical Association, Vol. 87, No. 417, pp. 143-148

A recent reference book on the problem of outlier identification is:

Maronna R. A., Martin R. D. and Yohai V. J. (2006). Robust Statistics: Theory and Methods. Wiley, New York.

These (and many other variations of these) methods are implemented (among other) in the robustbase R package.

Related Question