Solved – Detecting outliers in count data

count-datafittingoutliers

I have what I naively thought to be a fairly straight forward problem that involves outlier detection for many different sets of count data. Specifically, I want to determine if one or more values in a series of count data is higher or lower than expected relative to the rest of the counts in the distribution.

The confounding factor is that I need to do this for 3,500 distributions and it is likely some of them will fit a zero inflated overdispersed poisson, while others may best fit a negative binomial or ZINB, while still others may be normally distributed. For this reason, simple Z-scores or plotting of the distribution are not appropriate for much of the dataset. Here is an example of the count data for which I want to detect outliers.

counts1=[1 1 1 0 2 1 1 0 0 1 1 1 1 1 0 0 0 0 1 2 1 1 2 1 1 1 1 0 0 1 0 1 1 1 1 0 
         0 0 0 0 1 2 1 1 1 1 1 1 0 1 1 2 0 0 0 1 0 1 2 1 1 0 2 1 1 1 0 0 1 0 0 0 
         2 0 1 1 0 2 1 0 1 1 0 0 2 1 0 1 1 1 1 2 0 3]
counts2=[0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 
         0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
         0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 
         1 1 0 0 0]
counts3=[14 13 14 14 14 14 13 14 14 14 14 14 15 14 14 14 14 14 14 15 14 13 14 14 
         15 12 13 17 13 14 14 14 14 15 14 14 13 14 13 14 14 14 14 13 14 14 14 15 
         15 14 14 14 14 14 15 14 1414 14 15 14 14 14 14 14 14 14 14 14 14 14 14 13 16]
counts4=[0 3 1.......]
and so on up to counts3500.

Initially I thought I would need to write a loop in Python or R that would apply a set of models to each distribution and select the best fitting model according to AIC or other (maybe the fitdistrplus in R?). I could then ask what were extremes for the given distribution (the counts that fall in the tails e.g. would a count of "4" be an outlier in the counts1 distribution above?). However, I am not sure this is a valid strategy, and it occurred to me there may be a simple methodology for determining outliers in count data of which I was not aware. I have searched extensively and found nothing that seems appropriate for my problem given the number of distributions I want to look at.

My ultimate goal is to detect significant increases or decreases in a count for each distribution of counts, using the most statistically appropriate methodology.

Best Answer

You cannot use the distance of an observation from a classical fit of your data to reliably detect outliers because the fitting procedure you use is itself liable to being pulled towards the outliers (this is called the masking effect). One simple way to reliably detect outliers is to use the general idea you suggested (distance from fit) but replacing the classical estimators by robust ones much less susceptible to be swayed by outliers. Below I present a general illustration of the idea and then discuss the solution for your specific problem.

An illustration: consider the following 20 observations drawn from a $\mathcal{N}(0,1)$ (rounded to the second digit):

x<-c(-2.21,-1.84,-.95,-.91,-.36,-.19,-.11,-.1,.18,
.3,.31,.43,.51,.64,.67,.72,1.22,1.35,8.1,17.6)

(the last two really ought to be .81 and 1.76 but have been accidentally misstyped).

Using a outlier detection rule based on comparing the statistic

$$\frac{|x_i-\text{ave}(x_i)|}{\text{sd}(x_i)}$$

to the quantiles of a normal distribution would never lead you to suspect that 8.1 is an outlier, leading you to estimate the $\text{sd}$ of the 'trimmed' series to be 2 (for comparison the raw, e.g. untrimmed, estimate of $\text{sd}$ is 4.35).

Had you used a robust statistic instead:

$$\frac{|x_i-\text{med}(x_i)|}{\text{mad}(x_i)}$$

and comparing the resulting robust $z$-scores to the quantiles of a normal, you would have correctly flagged the last two observations as outliers (and correctly estimated the $\text{sd}$ of the trimmed series to be 0.96).

(in the interest of completeness I should point out that some people, even in this age and day, prefer to cling the raw --untrimmed-- estimate of 4.35 rather than using the more precise estimate based on trimming but this is unintelligible to me)

For other distributions the situation is not that different, merely that you will have to pre-transform your data first. For example, in your case:

Suppose $X$ is your original count data. One trick is to use the transformation:

$$Y=2\sqrt{X}$$

and to exclude an observation as outlier if $Y>\text{med}(Y)+3$ (this rule is not symmetric and I for one would be very cautions about excluding observations from the left 'tail' of a count variable according to a data based threshold. Negative observations, Obviously, should be pretty safe to remove)

This is based on the idea that if $X$ is poisson, then

$$Y\approx \mathcal{N}(\text{med}(Y),1)$$

This approximation works reasonably well for poisson distributed data when $\lambda$ (the parameter of the poisson distribution) is larger than 3.

When $\lambda$ is smaller than 3 (or when the model governing the distribution of the majority of the data has a mode closer to 0 than a poisson $\lambda=3$, as in i.e. ZINB r.v.'s) the approximation tends to err on the conservative side (reject fewer data as outliers).

To see why this is considered 'conservative' consider that at the limit (when the data is binomial with very small $p$) no observation would ever be flagged as outlier by this rule and this is precisely the behaviour we want: to cause masking, outliers have to be able to drive the estimated parameters arbitrarily far away from their true values. When the data is drawn from a distribution with bounded support (such as the binomial), this can simply not happen...