Solved – Simulating an overdispersed negative binomial distribution

negative-binomial-distributionoverdispersion

I have RNAseq data in the form of normalized counts. The counts themselves follow an over-dispersed negative binomial, and I would like to generate random data which replicate these distributions. For example, I've subsampled a set of the counts ~4200 n, and the descriptives are as follows,

            Statistic    Std. Error
Mean        46.28        32.6
Lower       -17.64  
Upper       110.19  
Median      0   
Variance    4493154.99  
STD         2119.71 
Minimum     0   
Maximum     136974  
Skewness    63.86        0.04
Kurtosis    4123.38      0.08

I've tried the following:

data <- rnbinom(4200,46.28,0.9)

It seems to give a distribution similar to my data, but cuts off the upper bound range. So I'm not exactly sure if this is a robust simulation.

I do apologize for what is likely a rudimentary question. I'm not exactly a skilled statistician, so it's somewhat difficult to follow some of the documentation on these processes. Would anyone be kind enough to explain this to me in the way a novice would understand?

Best Answer

Since you have the mean and the variance, the easiest way to do this would be to use the parameterization via the mu and size parameters of rnbinom. Quoting from the help page:

An alternative parametrization (often used in ecology) is by the mean ‘mu’ (see above), and ‘size’, the dispersion parameter, where ‘prob’ = size/(size+mu)’. The variance is ‘mu + mu^2/size’ in this parametrization.

So we can use your mean and variance estimates and solve for size:

size=46.28^2/(4493154.99-46.28))

We can simulate like this:

rnbinom(4200,mu=46.28,size=46.28^2/(4493154.99-46.28))

However...

set.seed(1)
table(rnbinom(4200,mu=46.28,size=46.28^2/(4493154.99-46.28)))

    0     1     3     4     7     8     9    10    13    18    19    24    52 
 4174     1     1     1     1     1     1     1     1     1     1     1     1 
   68    69    72   115   186   197  1261  3654  3745  4212  5790 10304 14269 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
25467 
    1

Note especially the right tail. If you simulate this repeatedly, the maximum value and the entire right tail will fluctuate wildly. (I just sampled this five times and got maxima between 14,294 and 186,670.) This is a consequence of (a) enormous overdispersion (your SD is 45 times the mean!), and (b) a large sample size. And of course your estimate of the SD and the overdispersion is largely determined by the right tail of your observations.

Whatever you plan on doing with your data, I'd take a good hard look at it. Maybe perform some sensitivity analysis. What would your result look like if your maximum had been "only" 50,000? And so on.

It may be that whatever conclusions you draw in the course of your analysis are mainly driven by the largest ten of your observations, 0.2% of your sample of size 4200. That is not a very strong position to argue from.