Solved – Detecting outliers in count data

count-datafittingoutliers

I have what I naively thought to be a fairly straight forward problem that involves outlier detection for many different sets of count data. Specifically, I want to determine if one or more values in a series of count data is higher or lower than expected relative to the rest of the counts in the distribution.

The confounding factor is that I need to do this for 3,500 distributions and it is likely some of them will fit a zero inflated overdispersed poisson, while others may best fit a negative binomial or ZINB, while still others may be normally distributed. For this reason, simple Z-scores or plotting of the distribution are not appropriate for much of the dataset. Here is an example of the count data for which I want to detect outliers.

counts1=[1 1 1 0 2 1 1 0 0 1 1 1 1 1 0 0 0 0 1 2 1 1 2 1 1 1 1 0 0 1 0 1 1 1 1 0 
         0 0 0 0 1 2 1 1 1 1 1 1 0 1 1 2 0 0 0 1 0 1 2 1 1 0 2 1 1 1 0 0 1 0 0 0 
         2 0 1 1 0 2 1 0 1 1 0 0 2 1 0 1 1 1 1 2 0 3]
counts2=[0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 
         0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
         0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 
         1 1 0 0 0]
counts3=[14 13 14 14 14 14 13 14 14 14 14 14 15 14 14 14 14 14 14 15 14 13 14 14 
         15 12 13 17 13 14 14 14 14 15 14 14 13 14 13 14 14 14 14 13 14 14 14 15 
         15 14 14 14 14 14 15 14 1414 14 15 14 14 14 14 14 14 14 14 14 14 14 14 13 16]
counts4=[0 3 1.......]
and so on up to counts3500.

Initially I thought I would need to write a loop in Python or R that would apply a set of models to each distribution and select the best fitting model according to AIC or other (maybe the fitdistrplus in R?). I could then ask what were extremes for the given distribution (the counts that fall in the tails e.g. would a count of "4" be an outlier in the counts1 distribution above?). However, I am not sure this is a valid strategy, and it occurred to me there may be a simple methodology for determining outliers in count data of which I was not aware. I have searched extensively and found nothing that seems appropriate for my problem given the number of distributions I want to look at.

My ultimate goal is to detect significant increases or decreases in a count for each distribution of counts, using the most statistically appropriate methodology.

Best Answer

You cannot use the distance of an observation from a classical fit of your data to reliably detect outliers because the fitting procedure you use is itself liable to being pulled towards the outliers (this is called the masking effect). One simple way to reliably detect outliers is to use the general idea you suggested (distance from fit) but replacing the classical estimators by robust ones much less susceptible to be swayed by outliers. Below I present a general illustration of the idea and then discuss the solution for your specific problem.

An illustration: consider the following 20 observations drawn from a $\mathcal{N}(0,1)$ (rounded to the second digit):

x<-c(-2.21,-1.84,-.95,-.91,-.36,-.19,-.11,-.1,.18,
.3,.31,.43,.51,.64,.67,.72,1.22,1.35,8.1,17.6)

(the last two really ought to be .81 and 1.76 but have been accidentally misstyped).

Using a outlier detection rule based on comparing the statistic

$$\frac{|x_i-\text{ave}(x_i)|}{\text{sd}(x_i)}$$

to the quantiles of a normal distribution would never lead you to suspect that 8.1 is an outlier, leading you to estimate the $\text{sd}$ of the 'trimmed' series to be 2 (for comparison the raw, e.g. untrimmed, estimate of $\text{sd}$ is 4.35).

Had you used a robust statistic instead:

$$\frac{|x_i-\text{med}(x_i)|}{\text{mad}(x_i)}$$

and comparing the resulting robust $z$-scores to the quantiles of a normal, you would have correctly flagged the last two observations as outliers (and correctly estimated the $\text{sd}$ of the trimmed series to be 0.96).

(in the interest of completeness I should point out that some people, even in this age and day, prefer to cling the raw --untrimmed-- estimate of 4.35 rather than using the more precise estimate based on trimming but this is unintelligible to me)

For other distributions the situation is not that different, merely that you will have to pre-transform your data first. For example, in your case:

Suppose $X$ is your original count data. One trick is to use the transformation:

$$Y=2\sqrt{X}$$

and to exclude an observation as outlier if $Y>\text{med}(Y)+3$ (this rule is not symmetric and I for one would be very cautions about excluding observations from the left 'tail' of a count variable according to a data based threshold. Negative observations, Obviously, should be pretty safe to remove)

This is based on the idea that if $X$ is poisson, then

$$Y\approx \mathcal{N}(\text{med}(Y),1)$$

This approximation works reasonably well for poisson distributed data when $\lambda$ (the parameter of the poisson distribution) is larger than 3.

When $\lambda$ is smaller than 3 (or when the model governing the distribution of the majority of the data has a mode closer to 0 than a poisson $\lambda=3$, as in i.e. ZINB r.v.'s) the approximation tends to err on the conservative side (reject fewer data as outliers).

To see why this is considered 'conservative' consider that at the limit (when the data is binomial with very small $p$) no observation would ever be flagged as outlier by this rule and this is precisely the behaviour we want: to cause masking, outliers have to be able to drive the estimated parameters arbitrarily far away from their true values. When the data is drawn from a distribution with bounded support (such as the binomial), this can simply not happen...

Related Solutions

Outliers – How to Detect Initial Trend or Outliers in Data

I don't necessarily see this as being easily treated as a time series problem. To comment on the way you detect outliers, Pierce and Chauvenet are flawed procedures that should be discussed only for historical purposes and never used. Outlier detection involves more than just knowning what the variance should be, the underlying population distribution needs to be assumed. Dixon's test and Grubbs' test assume normality and are desined for single outliers. In their original form they can be very sensitive to masking. But Dixon has variants that enable you to detect multiple outliers as long as the number of outliers is small. Also as I have mentioned in other post Dixon's test is robust to departures from normality. In your case 10 is small enough but I worry about trying to detect as many as four out of a sample of only 10. There is a little bit of a time dependence with you knowing why the outliers are likely to be among the first few measurements. But as Bill Huber pointed out in comments the sequence of 10 is too short to do any sophisticated time series modelling.

Normally I argue that outliers should not be rejected but studied further. Here you seem to have a physical reason for higher variability and or trends with the early measurements. CUSUM charts are good for detecting trends but the sequence may be too short to do much. It may be that something informal such as dropping the first four out of ten will work as a practical matter even though it is not a formal statistical test.

Solved – Detecting outliers along the distribution in a scatter plot

I think a funnel plot is a great idea. The challenge then is how to calculate the confidence band.

You need a distribution of allele frequencies for one SNP. This is the challenging step. I don't know enough about the subject to guess this, so I would just use the empirical probabilities.
If you have more than one SNP, possible mean values result from the combination of the possible values for each SNP.

Thus, you could do this:

ps <- prop.table(table((DF$mean_score)[DF$total_number_snps == 1]))
#        0.1         0.2         0.3         0.4         0.5         0.6         0.7 
#0.582089552 0.194029851 0.124378109 0.059701493 0.029850746 0.004975124 0.004975124

We assume that the probabilities for values > 0.7 are zero. The error we make with this assumption is negligible.

Now we can simulate data:

n <- 1e4
set.seed(42)
sims <- sapply(1:80, 
               function(k) 
                 rowSums(
                   replicate(k, sample((1:7)/10, n, TRUE, ps))) / k)
layout(t(1:2))
plot((mean_score) ~ total_number_snps, data = DF)
matplot(1:80, t(sims), pch = 1, col = 1)
layout(1)

You can see the same patterns in the simulated data as in your data.

Finally we can calculate quantiles:

quants <- apply(sims, 2, quantile, probs = c(0.025, 0.975))

plot((mean_score) ~ total_number_snps, data = DF)
matlines(1:80, t(quants), col = "red", lty = 2)

It looks like the assumption that the probability distribution for a single SNP's allele frequency is independent of the number of SNPs in a gene doesn't really hold for high numbers of SNPs (or the sample size is just too small, but you have more data).

Best Answer

Related Solutions

Outliers – How to Detect Initial Trend or Outliers in Data

Solved – Detecting outliers along the distribution in a scatter plot

Related Question