Robust Mean Estimation – Crash Course

meanoutliersreferencesrobust

I have a bunch (around 1000) of estimates and they are all supposed to be estimates of long-run elasticity. A little more than half of these is estimated using method A and the rest using a method B. Somewhere I read something like "I think method B estimates something very different than method A, because the estimates are much (50-60%) higher". My knowledge of robust statistics is next to nothing, so I only calculated the sample means and medians of both samples… and I immediately saw the difference. Method A is very concentrated, the difference between median and mean is very little, but method B sample varied wildly.

I concluded that the outliers and measurement errors skew the method B sample, so I threw away about 50 values (about 15%) that were very inconsistent with theory… and suddenly the means of both samples (including their CI) were very similar. The density plots as well.

(In the quest of eliminating outliers, I looked at the range of sample A and removed all sample points in B that fell outside it.) I would like you to tell me where I could find out some basics of robust estimation of means that would allow me to judge this situation more rigorously. And to have some references. I do not need very deep understanding of various techniques, rather read through a comprehensive survey of the methodology of robust estimation.

I t-tested for significance of mean difference after removing the outliers and the p-value is 0.0559 (t around 1.9), for the full samples the t stat was around 4.5. But that is not really the point, the means can be a bit different, but they should not differ by 50-60% as stated above. And I don't think they do.

Best Answer

Are you looking for the theory, or something practical?

If you are looking for books, here are some that I found helpful:

  • F.R. Hampel, E.M. Ronchetti, P.J.Rousseeuw, W.A. Stahel, Robust Statistics: The Approach Based on In fluence Functions, John Wiley & Sons, 1986.

  • P.J. Huber, Robust Statistics, John Wiley & Sons, 1981.

  • P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, John Wiley & Sons, 1987.

  • R.G. Staudte, S.J. Sheather, Robust Estimation and Testing, John Wiley & Sons, 1990.

If you are looking for practical methods, here are few robust methods of estimating the mean ("estimators of location" is I guess the more principled term):

  • The median is simple, well-known, and pretty powerful. It has excellent robustness to outliers. The "price" of robustness is about 25%.

  • The 5%-trimmed average is another possible method. Here you throw away the 5% highest and 5% lowest values, and then take the mean (average) of the result. This is less robust to outliers: as long as no more than 5% of your data points are corrupted, it is good, but if more than 5% are corrupted, it suddenly turns awful (it doesn't degrade gracefully). The "price" of robustness is less than the median, though I don't know what it is exactly.

  • The Hodges-Lehmann estimator computes the median of the set $\{(x_i+x_j)/2 : 1 \le i \le j \le n\}$ (a set containing $n(n+1)/2$ values), where $x_1,\dots,x_n$ are the observations. This has very good robustness: it can handle corruption of up to about 29% of the data points without totally falling apart. And the "price" of robustness is low: about 5%. It is a plausible alternative to the median.

  • The interquartile mean is another estimator that is sometimes used. It computes the average of the first and third quartiles, and thus is simple to compute. It has very good robustness: it can tolerate corruption of up to 25% of the data points. However, the "price" of robustness is non-trivial: about 25%. As a result, this seems inferior to the median.

  • There are many other measures that have been proposed, but the ones above seem reasonable.

In short, I would suggest the median or possibly the Hodges-Lehmann estimator.

P.S. Oh, I should explain what I mean by the "price" of robustness. A robust estimator is designed to still work decently well even if some of your data points have been corrupted or are otherwise outliers. But what if you use a robust estimator on a data set that has no outliers and no corruption? Ideally, we'd like the robust estimator to be as efficient at making use of the data as possible. Here we can measure the efficiency by the standard error (intuitively, the typical amount of error in the estimate produced by the estimator). It is known that if your observations come from a Gaussian distribution (iid), and if you know you won't need robustness, then the mean is optimal: it has the smallest possible estimation error. The "price" of robustness, above, is how much the standard error increases if we apply a particular robust estimator to this situation. A price of robustness of 25% for the median means that the size of the typical estimation error with the median will be about 25% larger than the size of the typical estimation error with the mean. Obviously, the lower the "price" is, the better.

Related Question