Solved – Robust mean estimation with O(1) update efficiency

estimationmeanrobust

I am looking for a robust estimation of the mean that has a specific property. I have a set of elements for which I want to calculate this statistic. Then, I add new elements one at a time, and for each additional element I would like to recalculate the statistic (also known as an online algorithm). I would like this update calculation to be fast, preferably O(1), i.e. not dependent on the size of the list.

The usual mean has this property that it can be updated efficiently, but is not robust to outliers. Typical robust estimators of the mean, like inter-quartile mean and trimmed mean cannot be updated efficiently (since they require maintaining a sorted list).

I would appreciate any suggestions for robust statistics that can be calculated/updated efficiently.

Best Answer

You might think of relating your problem to that of the recursive control chart. Such a control chart will evaluate whether a new observation is in control. If it is, this observation is included in the new estimate of the mean and variance (necessary to determine control limits).

Some background on robust, recursive, univariate control charts can be found here. One of the classic texts on quality control and control charts appears to be available online here.

Intuitively, using the a mean, $\mu_{t-1}$ and a variance $\sigma^2_{t-1}$ as inputs, you can determine whether a new observation at time $t$ is an outlier by a number of approaches. One would be to declare $x_t$ an outlier if it is outside of a certain number of standard deviations of $\mu_{t-1}$ (given $\sigma^2_{t-1})$, but this may run into problems if the data does not conform to certain distributional assumptions. If you want to go this road, then supposing you have determined if a new point is not an outlier, and would like to include it in your mean estimate with no special rate of forgetting. Then you can't do better than:

$\mu_t = \frac{t-1}{t}\mu_{t-1}+\frac{1}{t}x_t$

Similarly, you will need to update the variance recursively:

$\sigma^2_t = \frac{t-1}{t}\sigma^2_{t-1}+\frac{1}{t-1}(x_t-\mu_t)^2$

However, you might want to try some more conventional control charts. Other control charts which are more robust to the distribution of the data and can still handle non-stationarity (like the $\mu$ of your process slowly going higher) are the EWMA or CUSUM are recommended (see the textbook linked to above for more details on the charts and their control limits). These methods will typically be less computationally intensive than a robust because they have the advantage of simply needing to compare a single new observation to information derived from non-outlier observations. You can refine your estimates of the long term process $\mu$ and $\sigma^2$ used in the control limit calculations of these methods with the updating formulas given above if you like.

Regarding a chart like the EWMA, which forgets old observations and gives more weight to new ones, if you think that your data is stationary (meaning the parameters of the generating distribution do not change) then there is no need to forget older observations exponentially. You can set the forgetting factor accordingly. However, if you think that it is non-stationarity you will need to select a good value for the forgetting factor (again see the textbook for a way to do this).

I should also mention that before you begin monitoring and adding new observations online, you will need to obtain estimates of $\mu_0$ and $\sigma^2_0$ (the initial parameter values based on a training dataset) that are not influenced by outliers. If you suspect there are outliers in your training data, you can pay the one-time cost of using a robust method to estimate them.

I think an approach along these lines will lead to the fastest updating for your problem.

Related Question