Solved – Correcting for outliers in a running average

cooks-distancemoving averageoutliers

We have a daemon that reads in data from some sensors, and among the things it calculates (besides simply just reporting the state) is the average time it takes for the sensors to change from one value to another. It keeps a running average of 64 datapoints, and assumes that runtime is fairly constant.

Unfortunately, as demonstrated by the below graph, the input data isn't the most pristine:

(Each line represents a different set of data; the x-axis doesn't really mean anything besides a vague historical time axis).

My obvious solution for dealing with this would be to create a histogram of the data and then pick the mode. However, I was wondering if there were other methods that would yield better performance or would be more suited for operation with a running average. Some quick Wikipedia searches suggest algorithms for detecting outliers may be also suitable. Simplicity is a plus, since the daemon is written in C.

Edit: I scoped out Wikipedia and came up with these various techniques:

Chauvenet's criterion: using the mean and standard deviation, calculate the probability a particular datapoint would happen, and then exclude it if the probability is actually that bad is less than 50%. While this seems to be well suited for correcting a running average on the fly, I'm not quite convinced of its efficacy: it seems with large data-sets it would not want to discard datapoints.
Grubbs' test: Another method that uses difference from the mean to standard deviation, and has some expression for when the hypothesis of "no outliers" is rejected
Cook's distance: Measures the influence a datapoint has on a least squares regression; our application would probably reject it if it exceeded 1
Truncated mean: Discard the low end and the high end, and then take the mean as normal

Anyone have any specific experience and can comment on these statistical techniques?

Also, some comments about the physical situation: we're measuring the average time until completion of a mechanical washing machine, so its runtime should be fairly constant. I'm not sure if it actually has a normal distribution.

Edit 2: Another interesting question: when the daemon is bootstrapping, as in, doesn't have any previous data to analyze, how should it deal with incoming data? Simply not do any outlier pruning?

Edit 3: One more thing… if the hardware does change such that the runtimes do become different, is it worth it to make the algorithm sufficiently robust such that it won't discard these new runtimes, I should I just remember to flush the cache when that happens?

Best Answer

If that example graph you have is typical, then any of the criteria you list will work. Most of those statistical methods are for riding the edge of errors right at the fuzzy level of "is this really an error?" But your problem looks wildly simple.. your errors are not just a couple standard deviations from the norm, they're 20+. This is good news for you.

So, use the simplest heuristic. Always accept the first 5 points or so in order to prevent a startup spike from ruining your computation. Maintain mean and standard deviation. If your data point falls 5 standard deviations outside the norm, then discard it and repeat the previous data point as a filler.

If you know your typical data behavior in advance you may not even need to compute mean and standard deviation, you can hardwire absolute "reject" limits. This is actually better in that an initial error won't blow up your detector.

Related Solutions

Solved – Best way to display data with outliers

The standard traditional tool is a histogram. You can do this with the analysis tool pack in Excel, but I'd recommend using a stats package instead.

An extension of the histogram is a line plot showing the density - this is basically your idea of shwoing the bell curve, and it is probably the right one. From here there are various options such as drawing vertical lines to show the mean, median, 95th percentile, etc. To do this you will definitely want a stats package.

Some examples are below, including the code in R (which is free) that generated the data and drew the plots. You can see it that's not necessarily that hard to do this sort of thing in a stats package, if you're prepared to move beyond Excel.

# generate data
times <- rgamma(1000,1,1)

# draw histogram, showing counts
hist(times, col="grey")

# draw a density line plot
plot(density(times), bty="l")

# add vertical lines for the median and 95th percentile
abline(v=quantile(times, c(0.5, 0.95)), lty=2:3)

enter image description here

Solved – Detecting outliers along the distribution in a scatter plot

I think a funnel plot is a great idea. The challenge then is how to calculate the confidence band.

You need a distribution of allele frequencies for one SNP. This is the challenging step. I don't know enough about the subject to guess this, so I would just use the empirical probabilities.
If you have more than one SNP, possible mean values result from the combination of the possible values for each SNP.

Thus, you could do this:

ps <- prop.table(table((DF$mean_score)[DF$total_number_snps == 1]))
#        0.1         0.2         0.3         0.4         0.5         0.6         0.7 
#0.582089552 0.194029851 0.124378109 0.059701493 0.029850746 0.004975124 0.004975124

We assume that the probabilities for values > 0.7 are zero. The error we make with this assumption is negligible.

Now we can simulate data:

n <- 1e4
set.seed(42)
sims <- sapply(1:80, 
               function(k) 
                 rowSums(
                   replicate(k, sample((1:7)/10, n, TRUE, ps))) / k)
layout(t(1:2))
plot((mean_score) ~ total_number_snps, data = DF)
matplot(1:80, t(sims), pch = 1, col = 1)
layout(1)

You can see the same patterns in the simulated data as in your data.

Finally we can calculate quantiles:

quants <- apply(sims, 2, quantile, probs = c(0.025, 0.975))

plot((mean_score) ~ total_number_snps, data = DF)
matlines(1:80, t(quants), col = "red", lty = 2)

It looks like the assumption that the probability distribution for a single SNP's allele frequency is independent of the number of SNPs in a gene doesn't really hold for high numbers of SNPs (or the sample size is just too small, but you have more data).

Best Answer

Related Solutions

Solved – Best way to display data with outliers

Solved – Detecting outliers along the distribution in a scatter plot

Related Question