Solved – Comparing two vectors from negative binomial distribution in R

goodness of fitnegative-binomial-distributionr

I'm using R and have two vectors of discrete values. They are not strictly speaking categorical because the values themselves are number of dots counted on the image of a cell (whole vector is all the cells on the image). There are two vectors: reference and a vector with dot counts after some perturbation

What I believe is that such data should follow negative binomial distribution and some sort of goodness of fit should give a p-value and some statistic describing whether the two distributions differ significantly.

What people advised me is that chi square test would do the trick but in my understanding chi square considers all values only as a category and ignores the fact that these are numbers and if lets say number of cells with 5 dots decreased a bit while number of cells with 4 dots increased that's not the same as if same situation would happen with 0-dots and 6-dots categories.

However what I didn't find is a test which could deal with negative binomial distributions. I hope I described the problem clearly. So if somebody know any test which would deal with such kind of data or if anybody thinks that my assumptions are wrong you are welcome to share your ideas.

Example 1

library(ggplot2)
c.dots = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 3, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
           0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 2, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 2,
           0, 0, 0, 1, 0, 0, 1, 0, 0, 3, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1,
           0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0,
           0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 2, 0,
           0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0,
           0, 0, 1, 1, 0, 0, 1, 0, 2, 0, 1, 0, 2, 0, 0, 1, 0, 0, 1, 1, 0, 0, 3, 0, 0,
           0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0)

w.dots = c(0, 0, 0, 1, 3, 1, 1, 1, 1, 0, 0, 2, 0, 0, 2, 1, 0, 1, 3, 0, 1, 0, 0, 0, 2,
           0, 2, 2, 0, 3, 1, 2, 1, 0, 2, 1, 0, 2, 0, 1, 2, 1, 0, 0, 1, 0, 1, 1, 0, 0,
           0, 1, 1, 0, 2, 0, 0, 1, 3, 0, 0, 1, 0, 2, 1, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1,
           2, 4, 1, 0, 0, 2, 2, 0, 1, 0, 1, 3, 0, 2, 1, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0,
           1, 1, 0, 1, 0, 0, 2, 0, 1, 0, 2, 1, 0, 1, 2, 0, 4, 2, 0, 1, 0, 2, 0, 1, 2,
           1, 1, 2, 1, 1, 3, 1, 0, 1, 0, 1, 2, 0, 1, 2, 0, 1, 1, 2, 2, 0, 3, 0, 1, 1,
           0, 0, 2, 0, 1, 1, 0, 1, 2, 0, 0, 1, 0, 1, 2, 0, 0, 4, 3, 0, 1, 0, 0, 1, 0,
           4, 0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 1, 0, 3, 1, 1, 0, 4, 1, 1, 3)

chisq.test(rbind(table(w.dots), table(c.dots)))

nbrand = rnbinom(length(c.dots), mu = 1, size = 1)
ggplot() + 
  geom_density(aes(x=x), data=data.frame(x=c.dots), fill="red", alpha=0.5) + 
  geom_density(aes(x=x), data=data.frame(x=w.dots), fill="blue", alpha=0.5) +
  geom_density(aes(x=x), data=data.frame(x=nbrand), colour="green", alpha=0, linetype=3)

Example 2

library(ggplot2)
c.dots = c(1, 0, 0, 1, 0, 0, 3, 0, 1, 0, 3, 0, 2, 0, 0, 2, 2, 0, 0, 1, 1, 0, 0, 1, 0,
           0, 0, 0, 1, 0, 1, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
           1, 1, 1, 2, 0, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 3, 4,         
           0, 1, 1, 0, 1, 0, 2, 1, 2, 2, 3, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1,
           1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 2, 0, 0, 3, 2, 2, 1, 0, 2, 0, 2, 2, 0, 0, 2,
           1, 0, 2, 0, 0, 2, 2, 1, 0, 0, 0, 0, 0, 1, 0, 3, 0, 1, 0, 0, 1, 0, 0, 0, 0,
           2, 1, 1, 0, 1, 0, 1, 1, 0, 1, 3, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 2, 1, 0,
           1, 2, 0, 0, 3, 3, 0, 1, 2, 0, 0, 1, 1, 0, 1, 1, 3, 1, 3, 0, 2, 0, 0, 0, 0)

w.dots = c(1, 3, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 3, 0, 0, 0, 1, 2, 0,
           1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 5, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 2,
           0, 1, 0, 3, 0, 0, 1, 2, 3, 1, 0, 0, 0, 2, 1, 1, 2, 0, 2, 0, 3, 0, 2, 0, 0,
           0, 0, 2, 0, 1, 0, 2, 0, 0, 1, 1, 2, 3, 0, 2, 2, 1, 0, 1, 0, 0, 1, 0, 1, 0,
           0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 2, 0,
           0, 1, 2, 1, 1, 1, 2, 1, 2, 3, 2, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 1, 1,
           0, 0, 1, 2, 1, 0, 1, 2, 1, 1, 1, 0, 1, 0, 0, 0, 2, 1, 1, 1, 0, 0, 0, 0, 0,
           1, 2, 1, 2, 0, 1, 2, 1, 0, 1, 3, 2, 1, 0, 0, 0, 0, 0, 2, 1, 1, 2, 2, 1, 2)

chisq.test(rbind(table(w.dots), table(c.dots)))

nbrand = rnbinom(length(c.dots), mu = 1, size = 1)
ggplot() + 
  geom_density(aes(x=x), data=data.frame(x=c.dots), fill="red", alpha=0.5) + 
  geom_density(aes(x=x), data=data.frame(x=w.dots), fill="blue", alpha=0.5) +
  geom_density(aes(x=x), data=data.frame(x=nbrand), colour="green", alpha=0, linetype=3)

Best Answer

Your dependent variable is a count ("number of dots counted on the image of a cell"). Asking whether the distribution of counts is similar in two groups is conceptually the same as asking whether group membership matters for the distribution of counts.

I suggest a Poisson regression as a first step where you model the dot count with group membership. In a second step, one might then try to decide whether the Poisson assumption of "conditional variance = conditional mean" is violated, suggesting a move to a quasi-Poisson model, to a Poisson-model with heteroscedasticity-consistent (HC) standard error estimates, or to a negative binomial model.

Given data c.dots and w.dots as in the OP's example 1: We first create a data frame with predicted variable Y= number of dots and predictor X= factor with group membership. Then we run a standard Poisson regression

> dotsDf <- data.frame(Y=c(c.dots, w.dots),
+                      X=factor(rep(c("c", "w"), c(length(c.dots), length(w.dots)))))

> glmFitP <- glm(Y ~ X, family=poisson(link="log"), data=dotsDf)
> summary(glmFitP)                      # Poisson model
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9289     0.1125  -8.256  < 2e-16 ***
Xw            0.8455     0.1345   6.286 3.26e-10 ***

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 509.01  on 399  degrees of freedom
Residual deviance: 465.90  on 398  degrees of freedom
AIC: 862.16

This indicates a significant predictor "group membership = w" resulting from dummy coding the grouping factor (2 groups => 1 dummy predictor, c is the reference level). For comparison, we can run the quasi-Poisson model that has an extra dispersion parameter for the conditional variance.

> glmFitQP <- glm(Y ~ X, family=quasipoisson(link="log"), data=dotsDf)
> summary(glmFitQP)                     # quasi-Poisson model
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.9289     0.1298  -7.158 3.96e-12 ***
Xw            0.8455     0.1551   5.450 8.85e-08 ***

(Dispersion parameter for quasipoisson family taken to be 1.330164)

    Null deviance: 509.01  on 399  degrees of freedom
Residual deviance: 465.90  on 398  degrees of freedom
AIC: NA

The parameter estimates are the same, but the standard errors of these estimates is slightly larger. The estimated dispersion parameter is slightly larger than 1 (the value in the Poisson-model), indicating some overdispersion. An alternative approach is to use a Poisson model with HC-consistent standard errors:

> library(sandwich)                     # for vcovHC()
> library(lmtest)                       # for coeftest()
> hcSE <- vcovHC(glmFitP, type="HC0")   # HC-consistent standard errors
> coeftest(glmFitP, vcov=hcSE)
z test of coefficients:

            Estimate Std. Error z value  Pr(>|z|)    
(Intercept) -0.92887    0.14084 -6.5952 4.246e-11 ***
Xw           0.84549    0.16033  5.2735 1.339e-07 ***

Again, somewhat larger standard errors. Now the negative binomial model:

> library(MASS)                         # for glm.nb()
> glmFitNB <- glm.nb(Y ~ X, data=dotsDf)
> summary(glmFitNB)                     # negative binomial model
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9289     0.1192  -7.791 6.66e-15 ***
Xw            0.8455     0.1456   5.806 6.40e-09 ***

(Dispersion parameter for Negative Binomial(3.212) family taken to be 1)

    Null deviance: 427.19  on 399  degrees of freedom
Residual deviance: 391.20  on 398  degrees of freedom
AIC: 856.87

              Theta:  3.21 
          Std. Err.:  1.46 

 2 x log-likelihood:  -850.867 

You can test the negative binomial model against the Poisson model in a likelihood ratio test for the model comparison:

> library(pscl)                         # for odTest()
> odTest(glmFitNB)
Likelihood ratio test of H0: Poisson, as restricted NB model:
n.b., the distribution of the test-statistic under H0 is non-standard

Critical value of test statistic at the alpha= 0.05 level: 2.7055 
Chi-Square Test Statistic =  7.2978 p-value = 0.003452

The result here indicates that the data are unlikely to come from a Poisson model.

For the OP's example 2, all these tests are non-significant.

Note that I slightly shortened the output from glm() and glm.nb().