If your data contains a single outlier, then it can be found reliably using the approach you suggest (without the iterations though). A formal approach to this
is
Cook, R. Dennis (1979). Influential Observations in Linear Regression. Journal of the American Statistical Association (American Statistical Association) 74 (365): 169–174.
For finding more than one outlier, for many years, the leading method was the so-called $M$-estimation family of approach. This is a rather broad family of estimators that includes Huber's $M$ estimator of regression, Koenker's L1 regression as well as the approach proposed by Procastinator in his comment to your question.
The $M$ estimators with convex $\rho$ functions have the advantage that they have about the same numerical complexity as a regular regression estimation. The big disadvantage is that they can only reliably find the outliers if:
- the contamination rate of your sample is smaller than $\frac{1}{1+p}$ where $p$ is the number of design variables,
- or if the outliers are not outlying in the design space (Ellis and Morgenthaler (1992)).
You can find good implementation of $M$ ($l_1$) estimates of regression in the robustbase
(quantreg
) R
package.
If your data contains more than $\lfloor\frac{n}{p+1}\rfloor$ outlier potentially also outlying on the design space, then, finding them amounts to solving a combinatorial problem (equivalently the solution to an $M$ estimator with re-decending/non-convex $\rho$ function).
In the last 20 years (and specially last 10) a large body of fast and reliable outlier detection algorithms have been designed to approximately solve this combinatorial problem. These are now widely implemented in the most popular statistical packages (R, Matlab, SAS, STATA,...).
Nonetheless, the numerical complexity of finding outliers with these approaches is typically of order $O(2^p)$. Most algorithm can be used in practice for values of $p$ in the mid teens. Typically these algorithms are linear in $n$ (the number of observations) so the number of observation isn't an issue. A big advantage is that most of these algorithms are embarrassingly parallel. More recently, many approaches specifically designed for higher dimensional data have been proposed.
Given that you did not specify $p$ in your question, I will list some references
for the case $p<20$. Here are some papers that explain this in greater details in these series of review articles:
Rousseeuw, P. J. and van Zomeren B.C. (1990). Unmasking Multivariate Outliers and Leverage Points. Journal of the American Statistical Association, Vol. 85, No. 411, pp. 633-639.
Rousseeuw, P.J. and Van Driessen, K. (2006). Computing LTS Regression for Large Data Sets. Data Mining and Knowledge Discovery archive Volume 12 Issue 1, Pages 29 - 45.
Hubert, M., Rousseeuw, P.J. and Van Aelst, S. (2008). High-Breakdown Robust Multivariate Methods. Statistical Science, Vol. 23, No. 1, 92–119
Ellis S. P. and Morgenthaler S. (1992). Leverage and Breakdown in L1 Regression. Journal of the American Statistical Association, Vol. 87,
No. 417, pp. 143-148
A recent reference book on the problem of outlier identification is:
Maronna R. A., Martin R. D. and Yohai V. J. (2006). Robust Statistics: Theory and
Methods. Wiley, New York.
These (and many other variations of these) methods are implemented (among other) in the robustbase
R
package.
I break your concerns about the estimator into two areas: efficiency and asymptotic validity. I'll define a procedure as asymptotically valid if the point estimates are consistent, and the estimated variance-covariance is consistent. An extension of Alecos's arguments show, the robust (ie, sandwich) standard errors result in asymptotic validity, regardless of the assumed weighting matrix, and in fact this result even holds for clustered/correlated data (as long as independence holds on at the uppermost level of clustering).
I'll define the efficiency of the estimate as the true asymptotic variance/covariance matrix of the coefficients. Of course, from Gauss-Markov we know that only when you select weights proportional to the inverse conditional variance of each observation will you achieve the best limiting unbiased limit.$^1$ So based on first order, asymptotic concerns, we may just take the best stab at estimating the weightings we can, then go ahead and just robust standard errors to guard against mistakes in the weights.
To say anything more refined then this we need to think of second-order asymptotic or finite sample concerns. An example of a second order concern might be "variance of the variance." While I don't have the inclination to try to make Aleco's argument rigorous, I believe it does hold--that when you estimate additional, unnecessary parameters you will introduce additional variance in the remaining parameters. (You might be able to make it rigorous by considering schur decompositions of blocks of the information matrix?) So there is probably a second-order bias-variance tradeoff present: when you use the robust standard errors, you eliminate bias in the standard errors, at the cost of maybe more variance in them.
Most people seem to care more about the bias than the variance, but if this tradeoff is important, then the only advice I have to offer is to simulate or bootstrap see how much it might matter in your application. There's probably some additional theory extant or to be developed that could offer some advice by using higher-order asymptotics, but that's beyond my paygrade.
$^1$ Proof here, apparently originally due to Aitchen.
Best Answer
To be somewhat nitpicky, I would not quite say that outliers, heteroscedasticity, and non-normality don't matter with robust regression methods. Rather, I would say that robust methods are less likely to be impaired or harmed by those conditions. However, they could still have a negative effect.
The issue of whether the significance of the coefficients or the accuracy of their estimation is what's important is really unrelated to robust regression. Which of those is more important to you depends on the questions you are trying to answer, not what tools you use to try to answer them. Instead, consider a case where you want to test the hypothesis that a given variable is unrelated to the response variable. You wouldn't want the answer you get to that question (either yes or no) to be driven by an outlier. So you would use robust methods to help ensure that your answer is representative of the bulk of your data. Likewise, consider a case where you want to know the slope of the relationship between a predictor variable and the response variable as accurately as possible. You wouldn't want the estimated slope value that you get to have been driven by an outlier. So you would use robust regression to protect against that possibility. In short, robust methods diminish the extent to which your results might be influenced by violations of the classical statistical assumptions.
I recognize your frustration that you did not get any significant results when you used these methods. There are a couple of possibilities here. It may be that what appeared to be the case prior to using robust regression (perhaps the results from a prior OLS regression analysis) were driven by violations of the OLS assumptions and the null hypothesis is actually true. The other possibility is that, when OLS assumptions do hold, standard methods will have more power than robust methods.