Solved – Is it ok to assume normal distribution to count data if the QQ plot indicates normal distributed residuals

negative-binomial-distributionpoisson distribution

I have a Y count variable (# of predation events) which is overdispersed.

Thus I should assume a negative binomial distribution (actually it is NB distribution seeing the histogram). However, as I run the regression assuming normal distributed data, the residuals follow a normal distribution. This allows me to assume a normal distribution or should I take negative binomial since its the variable distribution?

Best Answer

If there are enough counts then one can use the Central Limit Theorem for Normal Distribution of Negative Binomial. In specific see the answer to that question. Otherwise, one uses the negative binomial. One can test both distributions for how well they fit the data to see which applies best.

Related Solutions

Negative Binomial Distribution – Distribution Describing the Difference Between Negative Binomial Distributed Variables

I don't know the name of this distribution but you can just derive it from the law of total probability. Suppose $X, Y$ each have negative binomial distributions with parameters $(r_{1}, p_{1})$ and $(r_{2}, p_{2})$, respectively. I'm using the parameterization where $X,Y$ represent the number of successes before the $r_{1}$'th, and $r_{2}$'th failures, respectively. Then,

$$ P(X - Y = k) = E_{Y} \Big( P(X-Y = k) \Big) = E_{Y} \Big( P(X = k+Y) \Big) = \sum_{y=0}^{\infty} P(Y=y)P(X = k+y) $$

We know

$$ P(X = k + y) = {k+y+r_{1}-1 \choose k+y} (1-p_{1})^{r_{1}} p_{1}^{k+y} $$

and

$$ P(Y = y) = {y+r_{2}-1 \choose y} (1-p_{2})^{r_{2}} p_{2}^{y} $$

$$ P(X-Y=k) = \sum_{y=0}^{\infty} {y+r_{2}-1 \choose y} (1-p_{2})^{r_{2}} p_{2}^{y} \cdot {k+y+r_{1}-1 \choose k+y} (1-p_{1})^{r_{1}} p_{1}^{k+y} $$

That's not pretty (yikes!). The only simplification I see right off is

$$ p_{1}^{k} (1-p_{1})^{r_{1}} (1-p_{2})^{r_{2}} \sum_{y=0}^{\infty} (p_{1}p_{2})^{y} {y+r_{2}-1 \choose y} {k+y+r_{1}-1 \choose k+y} $$

which is still pretty ugly. I'm not sure if this is helpful but this can also be re-written as

$$ \frac{ p_{1}^{k} (1-p_{1})^{r_{1}} (1-p_{2})^{r_{2}} }{ (r_{1}-1)! (r_{2}-1)! } \sum_{y=0}^{\infty} (p_{1}p_{2})^{y} \frac{ (y+r_{2}-1)! (k+y+r_{1}-1)! }{y! (k+y)! } $$

I'm not sure if there is a simplified expression for this sum but it could be approximated numerically if you only need it to calculate $p$-values

I verified with simulation that the above calculation is correct. Here is a crude R function to calculate this mass function and carry out a few simulations

  f = function(k,r1,r2,p1,p2,UB)  
  {

  S=0
  const = (p1^k) * ((1-p1)^r1) * ((1-p2)^r2)
  const = const/( factorial(r1-1) * factorial(r2-1) ) 

  for(y in 0:UB)
  {
     iy = ((p1*p2)^y) * factorial(y+r2-1)*factorial(k+y+r1-1)
     iy = iy/( factorial(y)*factorial(y+k) )
     S = S + iy
  }

  return(S*const)
  }

 ### Sims
 r1 = 6; r2 = 4; 
 p1 = .7; p2 = .53; 
 X = rnbinom(1e5,r1,p1)
 Y = rnbinom(1e5,r2,p2)
 mean( (X-Y) == 2 ) 
 [1] 0.08508
 f(2,r1,r2,1-p1,1-p2,20)
 [1] 0.08509068
 mean( (X-Y) == 1 ) 
 [1] 0.11581
 f(1,r1,r2,1-p1,1-p2,20)
 [1] 0.1162279
 mean( (X-Y) == 0 ) 
 [1] 0.13888
 f(0,r1,r2,1-p1,1-p2,20)
 [1] 0.1363209

I've found the sum converges very quickly for all of the values I tried, so setting UB higher than 10 or so is not necessary. Note that R's built in rnbinom function parameterizes the negative binomial in terms of the number of failures before the $r$'th success, in which case you'd need to replace all of the $p_{1}, p_{2}$'s in the above formulas with $1-p_{1}, 1-p_{2}$ for compatibility.

Discrete Data – How to Fit a Discrete Distribution to Count Data

Methods of fitting discrete distributions

There are three main methods* used to fit (estimate the parameters of) discrete distributions.

1) Maximum Likelihood

This finds the parameter values that give the best chance of supplying your sample (given the other assumptions, like independence, constant parameters, etc)

2) Method of moments

This finds the parameter values that make the first few population moments match your sample moments. It’s often fairly easy to do, and in many cases yields fairly reasonable estimators. It’s also sometimes used to supply starting values to ML routines.

3) Minimum chi-square

This minimizes the chi-square goodness of fit statistic over the discrete distribution, though sometimes with larger data sets, the end-categories might be combined for convenience. It often works fairly well, and it even arguably has some advantages over ML in particular situations, but generally it must be iterated to convergence, in which case most people tend to prefer ML.

The first two methods are also used for continuous distributions; the third is usually not used in that case.

These by no means comprise an exhaustive list, and it would be quite possible to estimate parameters by minimizing the KS-statistic for example – and even (if you adjust for the discreteness), to get a joint consonance region from it, if you were so inclined. Since you’re working in R, ML estimation is quite easy to achieve for the negative binomial. If your sample were in x, it’s as simple as library(MASS);fitdistr (x,"negative binomial"):

> library(MASS) 
> x <- rnegbin(100,7,3)
> fitdistr (x,"negative binomial")
     size         mu    
  3.6200839   6.3701156 
 (0.8033929) (0.4192836)

Those are the parameter estimates and their (asymptotic) standard errors.

In the case of the Poisson distribution, MLE and MoM both estimate the Poisson parameter at the sample mean.

If you'd like to see examples, you should post some actual counts. Note that your histogram has been done with bins chosen so that the 0 and 1 categories are combined and we don't have the raw counts.

As near as I can guess, your data are roughly as follows:

    Count:  0&1   2   3   4   5   6  >6    
Frequency:  311 197  74  15   3   1   0

But the big numbers will be uncertain (it depends heavily on how accurately the low-counts are represented by the pixel-counts of their bar-heights) and it could be some multiple of those numbers, like twice those numbers (the raw counts affect the standard errors, so it matters whether they're about those values or twice as big)

The combining of the first two groups makes it a little bit awkward (it's possible to do, but less straightforward if you combine some categories. A lot of information is in those first two groups so it's best not to just let the default histogram lump them).

* Other methods of fitting discrete distributions are possible of course (one might match quantiles or minimise other goodness of fit statistics for example). The ones I mention appear to be the most common.