Solved – Fitting negative binomial distribution to large count data

goodness of fitnegative-binomial-distributionrstatistical significance

I have a ~1 million data points. Here is the link to file data.txt Each of them can take a value between 0 to 145. It's a discrete dataset. Below is the histogram of dataset. On x-axis is the count (0-145) and on y-axis is the density.

source of data: I have around 20 reference objects and 1 Million random object in the space. For each of these 1 million random objects i calculated Manhattan distance with respect to these 20 reference objects. However i only considered shortest distance among these 20 reference objects. So i have 1 million Manhattan distances (which you can find in the link to file given in post)

I tried to fit the Poisson and Negative binomial distributions to this data set using R. I found the fit resulting from the negative binomial distributions seems reasonable. Below is the fitted curve (in blue).

Final goal: Once i have fitted this distribution appropriately, i would like to considered this distribution as random distribution of distances. Next time when I calculate the distance (d) of the any object to these 20 reference objects, I should be able to know if the (d) is significant or just part of random distribution.

To evaluate the goodness of fit I calculated the chi squared test using R with the observed frequencies and probabilities I got from negative binomial fit. Although the blue curve nicely fit to distribution, P-value returning from the chi squared test is extremely low.

This put me in confusion a bit. I have two related questions:

Is the choice of negative binomial distribution for this dataset appropriate?
If the chi squared test P-value is so low, should I consider another distribution?

Below is the complete code I used:

# read the file containing count data
data <- read.csv("data.txt", header=FALSE)

# plot the histogram
hist(data[[1]], prob=TRUE, breaks=145)

# load library
library(fitdistrplus)

# fit the negative binomial distribution
fit <- fitdist(data[[1]], "nbinom")

# get the fitted densities. mu and size from fit.
fitD <- dnbinom(0:145, size=25.05688, mu=31.56127)

# add fitted line (blue) to histogram
lines(fitD, lwd="3", col="blue")

# Goodness of fit with the chi squared test  
# get the frequency table
t <- table(data[[1]])   

# convert to dataframe
df <- as.data.frame(t)

# get frequencies
observed_freq <- df$Freq

# perform the chi-squared test
chisq.test(observed_freq, p=fitD)

Best Answer

Firstly, goodness of fitness tests or tests for particular distributions will typically reject the null hypothesis given a sufficiently large sample size, because we are hardly ever in the situation, where data exactly arises from a particular distribution and we did also take into account all relevant (possibly unmeasured) covariates that explain further differences between subject/units. However, in practice such deviations can be pretty irrelevant and it is well known that many models can be used, even if their are some deviations from distributional assumptions (most famously regarding the normality of residuals in regression models with normal error terms).

Secondly, a negative binomial model is a relatively logical default choice for count data (that can only be $\geq 0$). We do not have that many details though and there might be obvious features of the data (e.g. regarding how it arises) that would suggest something more sophisticated. E.g. accounting for key covariates using negative binomial regression could be considered.

Related Solutions

Solved – Comparing two vectors from negative binomial distribution in R

Your dependent variable is a count ("number of dots counted on the image of a cell"). Asking whether the distribution of counts is similar in two groups is conceptually the same as asking whether group membership matters for the distribution of counts.

I suggest a Poisson regression as a first step where you model the dot count with group membership. In a second step, one might then try to decide whether the Poisson assumption of "conditional variance = conditional mean" is violated, suggesting a move to a quasi-Poisson model, to a Poisson-model with heteroscedasticity-consistent (HC) standard error estimates, or to a negative binomial model.

Given data c.dots and w.dots as in the OP's example 1: We first create a data frame with predicted variable Y= number of dots and predictor X= factor with group membership. Then we run a standard Poisson regression

> dotsDf <- data.frame(Y=c(c.dots, w.dots),
+                      X=factor(rep(c("c", "w"), c(length(c.dots), length(w.dots)))))

> glmFitP <- glm(Y ~ X, family=poisson(link="log"), data=dotsDf)
> summary(glmFitP)                      # Poisson model
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9289     0.1125  -8.256  < 2e-16 ***
Xw            0.8455     0.1345   6.286 3.26e-10 ***

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 509.01  on 399  degrees of freedom
Residual deviance: 465.90  on 398  degrees of freedom
AIC: 862.16

This indicates a significant predictor "group membership = w" resulting from dummy coding the grouping factor (2 groups => 1 dummy predictor, c is the reference level). For comparison, we can run the quasi-Poisson model that has an extra dispersion parameter for the conditional variance.

> glmFitQP <- glm(Y ~ X, family=quasipoisson(link="log"), data=dotsDf)
> summary(glmFitQP)                     # quasi-Poisson model
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.9289     0.1298  -7.158 3.96e-12 ***
Xw            0.8455     0.1551   5.450 8.85e-08 ***

(Dispersion parameter for quasipoisson family taken to be 1.330164)

    Null deviance: 509.01  on 399  degrees of freedom
Residual deviance: 465.90  on 398  degrees of freedom
AIC: NA

The parameter estimates are the same, but the standard errors of these estimates is slightly larger. The estimated dispersion parameter is slightly larger than 1 (the value in the Poisson-model), indicating some overdispersion. An alternative approach is to use a Poisson model with HC-consistent standard errors:

> library(sandwich)                     # for vcovHC()
> library(lmtest)                       # for coeftest()
> hcSE <- vcovHC(glmFitP, type="HC0")   # HC-consistent standard errors
> coeftest(glmFitP, vcov=hcSE)
z test of coefficients:

            Estimate Std. Error z value  Pr(>|z|)    
(Intercept) -0.92887    0.14084 -6.5952 4.246e-11 ***
Xw           0.84549    0.16033  5.2735 1.339e-07 ***

Again, somewhat larger standard errors. Now the negative binomial model:

> library(MASS)                         # for glm.nb()
> glmFitNB <- glm.nb(Y ~ X, data=dotsDf)
> summary(glmFitNB)                     # negative binomial model
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9289     0.1192  -7.791 6.66e-15 ***
Xw            0.8455     0.1456   5.806 6.40e-09 ***

(Dispersion parameter for Negative Binomial(3.212) family taken to be 1)

    Null deviance: 427.19  on 399  degrees of freedom
Residual deviance: 391.20  on 398  degrees of freedom
AIC: 856.87

              Theta:  3.21 
          Std. Err.:  1.46 

 2 x log-likelihood:  -850.867

You can test the negative binomial model against the Poisson model in a likelihood ratio test for the model comparison:

> library(pscl)                         # for odTest()
> odTest(glmFitNB)
Likelihood ratio test of H0: Poisson, as restricted NB model:
n.b., the distribution of the test-statistic under H0 is non-standard

Critical value of test statistic at the alpha= 0.05 level: 2.7055 
Chi-Square Test Statistic =  7.2978 p-value = 0.003452

The result here indicates that the data are unlikely to come from a Poisson model.

For the OP's example 2, all these tests are non-significant.

Note that I slightly shortened the output from glm() and glm.nb().

Solved – Alternative to Pearson’s chi-square goodness of fit test, when expected counts < 5

I think you are asking for the "Multinomial Exact Test", which can exactly compute the p-value for whether a multinomial random variable (which takes any of a certain set of values) follows a certain distribution.

Best Answer

Related Solutions

Solved – Comparing two vectors from negative binomial distribution in R

Solved – Alternative to Pearson’s chi-square goodness of fit test, when expected counts < 5

Related Question