Solved – Robust mean estimation with O(1) update efficiency

estimationmeanrobust

I am looking for a robust estimation of the mean that has a specific property. I have a set of elements for which I want to calculate this statistic. Then, I add new elements one at a time, and for each additional element I would like to recalculate the statistic (also known as an online algorithm). I would like this update calculation to be fast, preferably O(1), i.e. not dependent on the size of the list.

The usual mean has this property that it can be updated efficiently, but is not robust to outliers. Typical robust estimators of the mean, like inter-quartile mean and trimmed mean cannot be updated efficiently (since they require maintaining a sorted list).

I would appreciate any suggestions for robust statistics that can be calculated/updated efficiently.

Best Answer

You might think of relating your problem to that of the recursive control chart. Such a control chart will evaluate whether a new observation is in control. If it is, this observation is included in the new estimate of the mean and variance (necessary to determine control limits).

Some background on robust, recursive, univariate control charts can be found here. One of the classic texts on quality control and control charts appears to be available online here.

Intuitively, using the a mean, $\mu_{t-1}$ and a variance $\sigma^2_{t-1}$ as inputs, you can determine whether a new observation at time $t$ is an outlier by a number of approaches. One would be to declare $x_t$ an outlier if it is outside of a certain number of standard deviations of $\mu_{t-1}$ (given $\sigma^2_{t-1})$, but this may run into problems if the data does not conform to certain distributional assumptions. If you want to go this road, then supposing you have determined if a new point is not an outlier, and would like to include it in your mean estimate with no special rate of forgetting. Then you can't do better than:

$\mu_t = \frac{t-1}{t}\mu_{t-1}+\frac{1}{t}x_t$

Similarly, you will need to update the variance recursively:

$\sigma^2_t = \frac{t-1}{t}\sigma^2_{t-1}+\frac{1}{t-1}(x_t-\mu_t)^2$

However, you might want to try some more conventional control charts. Other control charts which are more robust to the distribution of the data and can still handle non-stationarity (like the $\mu$ of your process slowly going higher) are the EWMA or CUSUM are recommended (see the textbook linked to above for more details on the charts and their control limits). These methods will typically be less computationally intensive than a robust because they have the advantage of simply needing to compare a single new observation to information derived from non-outlier observations. You can refine your estimates of the long term process $\mu$ and $\sigma^2$ used in the control limit calculations of these methods with the updating formulas given above if you like.

Regarding a chart like the EWMA, which forgets old observations and gives more weight to new ones, if you think that your data is stationary (meaning the parameters of the generating distribution do not change) then there is no need to forget older observations exponentially. You can set the forgetting factor accordingly. However, if you think that it is non-stationarity you will need to select a good value for the forgetting factor (again see the textbook for a way to do this).

I should also mention that before you begin monitoring and adding new observations online, you will need to obtain estimates of $\mu_0$ and $\sigma^2_0$ (the initial parameter values based on a training dataset) that are not influenced by outliers. If you suspect there are outliers in your training data, you can pay the one-time cost of using a robust method to estimate them.

I think an approach along these lines will lead to the fastest updating for your problem.

Related Solutions

Bayesian-Model – Developing a Robust Bayesian Model for Estimating Scale of Roughly Normal Distribution

Bayesian inference in a T noise model with an appropriate prior will give a robust estimate of location and scale. The precise conditions that the likelihood and prior need to satisfy are given in the paper Bayesian robustness modelling of location and scale parameters by Andrade and O'Hagan (2011). The estimates are robust in the sense that a single observation cannot make the estimates arbitrarily large, as demonstrated in figure 2 of the paper.

When the data is normally distributed, the SD of the fitted T distribution (for fixed $\nu$) does not match the SD of the generating distribution. But this is easy to fix. Let $\sigma$ be the standard deviation of the generating distribution and let $s$ be the standard deviation of the fitted T distribution. If the data is scaled by 2, then from the form of the likelihood we know that $s$ must scale by 2. This implies that $s = \sigma f(\nu)$ for some fixed function $f$. This function can be computed numerically by simulation from a standard normal. Here is the code to do this:

library(stats)
library(stats4)
y = rnorm(100000, mean=0,sd=1)
nu = 4
nLL = function(s) -sum(stats::dt(y/s,nu,log=TRUE)-log(s))
fit = mle(nLL, start=list(s=1), method="Brent", lower=0.5, upper=2)
# the variance of a standard T is nu/(nu-2)
print(coef(fit)*sqrt(nu/(nu-2)))

For example, at $\nu=4$ I get $f(\nu)=1.18$. The desired estimator is then $\hat{\sigma} = s/f(\nu)$.

Solved – Comparing spread (dispersion) between samples

As for me, it is not the best way of ranking. Your means have real values, so (as I understood) C1 will always be satisfied so you will not have to apply C2 selection criteria. Also you should not forget that you have confidence intervals for means, not just point estimations of means: it can be better by chance. So I would look at more traditional ways to measure performance, such as $r^2$, and of course use cross-validation (or train/test approach). Also I would recommend to exclude several outliers that are wrongly estimated by all 4 algorithms.

But, answering the second question, I would recommend to use $Q_n$ or $S_n$. Both of them have nice properties. Somebody can recommend to use Gini's means difference, but it has 0 breakdown point (but it is somewhat "robust" and also has a lot of good properties). As for C1, I would recommend to use Lehman-Hodges estimator.

Best Answer

Related Solutions

Bayesian-Model – Developing a Robust Bayesian Model for Estimating Scale of Roughly Normal Distribution

Solved – Comparing spread (dispersion) between samples

Related Question