Solved – Downweight outliers in mean

outliersrobusttrimmed-meanweighted meanwinsorizing

I have a bunch of points $x_i$ and would like to calculate a kind of weighted mean that deemphasizes outliers. My first idea was to weight each point by $1/ (x_i – \mu)^2$. However, the problem is that this includes the mean $\mu$ already. I could do this repeatedly (calulate mean, calculate new weights, repeat) starting with weights = 1, and stopping when the weighted mean doesn't change much anymore.

The other problem is that this diverges if one of the points is too close to the mean. One way to fix this is to pick a function that is monotonously increasing, and is = 0 for 0 and = 1 for $x\rightarrow \infty$, such as tanh. So my weight would be $\tanh(1/(x_i – \mu)^2)$. I tried this and it seems to converge, and disregard outliers good enough. But this seems very hacked together, and I thought this is probably already a solved problem.

So: What is the canonical way to calculated such a outlier-deemphasizing weighted mean? Is there a technique that is not iterative? Or if I have to do iterations, is there a technique that is guaranteed to converge (for reasonably well-behaved input data)?

I've seen some people suggest truncated means in similar cases. This doesn't work for me, as I only have a few data points (on the order of 10 per set of points). Also, I don't neccessarily know the scale or the typical standard deviation. Sometimes a deviation of 10 is normal, sometimes of 0.1. The solution should be reasonably scale-independent.

If it matters, I currently have two-dimensional data points and use the euclidean distance from the current midpoint as a measure in the above calculations.

Best Answer

Rounding up the comments that has value of an answer, several methods can be used here.

1. Trimmed mean (by @Bernhard)

Calculates the average of data that lies between the 5th and 95th percentile effectively discarding the extreme values. https://en.wikipedia.org/wiki/Truncated_mean

2. Winsorized mean (by @kjetil b halvorsen)

Sets the bottom 5% to 5th percentile, sets the top 5% to 95th percentile, then calculates the average of all that. https://en.wikipedia.org/wiki/Winsorizing

3. M-estimator (by @Michael M)

I'm sorry I can't provide concise explanation. Better see https://en.wikipedia.org/wiki/M-estimator

4. E-M algorithm (by @Tim)

https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

5. Median (by @Tim)

Median is less affected by outlier and is more robust than mean. Consider the set of numbers ${1, 2, 3, 4, 5}$, with mean 3 and median 3. If I were to change 5 to 50, the mean changes to be 12, but the median stays 3.

https://en.wikipedia.org/wiki/Median

Related Question