Solved – Downweight outliers in mean

outliersrobusttrimmed-meanweighted meanwinsorizing

I have a bunch of points $x_i$ and would like to calculate a kind of weighted mean that deemphasizes outliers. My first idea was to weight each point by $1/ (x_i – \mu)^2$. However, the problem is that this includes the mean $\mu$ already. I could do this repeatedly (calulate mean, calculate new weights, repeat) starting with weights = 1, and stopping when the weighted mean doesn't change much anymore.

The other problem is that this diverges if one of the points is too close to the mean. One way to fix this is to pick a function that is monotonously increasing, and is = 0 for 0 and = 1 for $x\rightarrow \infty$, such as tanh. So my weight would be $\tanh(1/(x_i – \mu)^2)$. I tried this and it seems to converge, and disregard outliers good enough. But this seems very hacked together, and I thought this is probably already a solved problem.

So: What is the canonical way to calculated such a outlier-deemphasizing weighted mean? Is there a technique that is not iterative? Or if I have to do iterations, is there a technique that is guaranteed to converge (for reasonably well-behaved input data)?

I've seen some people suggest truncated means in similar cases. This doesn't work for me, as I only have a few data points (on the order of 10 per set of points). Also, I don't neccessarily know the scale or the typical standard deviation. Sometimes a deviation of 10 is normal, sometimes of 0.1. The solution should be reasonably scale-independent.

If it matters, I currently have two-dimensional data points and use the euclidean distance from the current midpoint as a measure in the above calculations.

Best Answer

Rounding up the comments that has value of an answer, several methods can be used here.

1. Trimmed mean (by @Bernhard)

Calculates the average of data that lies between the 5th and 95th percentile effectively discarding the extreme values. https://en.wikipedia.org/wiki/Truncated_mean

2. Winsorized mean (by @kjetil b halvorsen)

Sets the bottom 5% to 5th percentile, sets the top 5% to 95th percentile, then calculates the average of all that. https://en.wikipedia.org/wiki/Winsorizing

3. M-estimator (by @Michael M)

I'm sorry I can't provide concise explanation. Better see https://en.wikipedia.org/wiki/M-estimator

4. E-M algorithm (by @Tim)

https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm

5. Median (by @Tim)

Median is less affected by outlier and is more robust than mean. Consider the set of numbers ${1, 2, 3, 4, 5}$, with mean 3 and median 3. If I were to change 5 to 50, the mean changes to be 12, but the median stays 3.

https://en.wikipedia.org/wiki/Median

Related Solutions

Outlier Detection – How to Detect Outliers in a Mixture of Gaussians Using Normal Distribution Models

I have suggested, in comments, that an "outlier" in this situation might be defined as a member of a "small" cluster centered at an "extreme" value. The meanings of the quoted terms need to be quantified, but apparently they can be: "small" would be a cluster of less than 10 values and "extreme" can be determined as outlying relative to the set of component means in the mixture model. In this case, outliers can be found with simple post-processing of any reasonable cluster analysis of the data.

Choices have to be made in fine-tuning this approach. These choices will depend on the nature of the data and therefore cannot be completely specified in a general answer like this. Instead, let's analyze some data. I use R due to its popularity on this site and succinctness (even compared to Python).

First, create some data as described in the question:

set.seed(17) # For reproducible results
centers <- rnorm(100, mean=100, sd=20)
x <- c(centers + rnorm(100*100, mean=0, sd=1), 
       rnorm(100, mean=250, sd=1), 
       rnorm(9, mean=300, sd=1))

This command specifies 102 components: 100 of them are situated like 100 independent draws from a normal(100, 20) distribution (and will therefore tend to lie between 50 and 150); one of them is centered at 250, and one is centered at 300. It then draws 100 values independently from each component (using a common standard deviation of 1) but, in the last component centered at 300, it draws only 9 values. According to the characterization of outliers, the 100 values centered at 250 do not constitute outliers: they should be viewed as a component of the mixture, albeit situated far from the others. However, one cluster of nine high values consists entirely of outliers. We need to detect these but no others.

Most omnibus univariate outlier-detection procedures would either not detect any of these 109 highest values or would indicate all 109 are outliers.

Suppose we have a good sense of the standard deviations of the components (obtained from prior information or from exploring the data). Use this to construct a kernel density estimate of the mixture:

d <- density(x, bw=1, n=1000)
plot(d, main="Kernel density")

KDE

The (almost invisible) blip at the extreme right qualifies as a set of outliers: its small area (less than 10/10109 = 0.001 of the total) indicates it consists of just a few values and its situation at one extreme of the x-axis earns it the appellation of "outlier" rather than "inlier." Checking these things is straightforward:

x0 <- d$x[d$y > 1000/length(x) * dnorm(5)]
gaps <- tail(x0, -1) - head(x0, -1)
histogram(gaps, main="Gap Counts")

Gap histogram

The density estimate d is represented by a 1D grid of 1000 bins. These commands have retained all bins in which the density is sufficiently large. For "large" I chose a very small value, to make sure that even the density of a single isolated value is picked up, but not so small that obviously separated components are merged.

Evidently the gap distribution has two high outliers (which can automatically be detected using any simple procedure, even an ad hoc one). One characterization is that they both exceed 25 (in this example). Let's find the values associated with them:

large.gaps <- gaps > 25
ranges <- rbind(tail(x0,-1)[large.gaps], c(tail(head(x0,-1)[large.gaps], -1), max(x))

The output is

         [,1]     [,2]
[1,] 243.9937 295.7732
[2,] 256.3758 300.9340

Within the range of data (from 25 to 301) these gaps determine two potential outlying ranges, one from 244 to 256 (column 1) and another from 296 to 301 (column 2). Let's see how many values lie within these ranges:

lapply(apply(ranges, 2, function(r){x[r[1] <= x & x <= r[2]]}), length)

The result is

[[1]]
[1] 100

[[2]]
[1] 9

The 100 is too large to be unusual: that's one of the components of the mixture. But the 9 is small enough. It remains to see whether any of these components might be considered outlying (as opposed to inlying):

apply(ranges, 2, mean)

The result is

[1] 250.1848 298.3536

The center of the 100-point cluster is at 250 and the center of the 9-point cluster is at 298, far enough from the rest of the data to constitute a cluster of outliers. We conclude there are nine outliers. Specifically, these are the values determined by column 2 of ranges,

x[ranges[1,2] <= x & x <= ranges[2,2]]

In order, they are

299.0379 300.0376 300.2696 300.3892 300.4250 300.5659 300.7018 300.8436 300.9340

Solved – Calculating mean of continuous time series

Rather than being right or wrong, here Matlab assumes that your start and end samples also extend towards to their left and right by equal amounts. So, according to Matlab, your signal starts from t=-0.5 and ends at t=6. This is 0-th order interpolation technique, in which every sample extends to its left or right by equal amounts.