Solved – Fitting a probability distribution to zero inflated data in R

distributionsprobabilityrzero inflation

I am trying to learn how to fit a probability distribution to a vector of data, using the program R, but there are a lot of potential probability distributions to use! So my question is, how do I find the best distribution for my data, and how do I prove that I have picked the right distribution? Can I acquire AIC values for a whole set of different distributions?

The data are observational count data of bees visiting flowers. Each species has a certain number of visits, hence the differing frequencies. The goal is to find the best distribution to describe the bee visitation, show that I have selected the right one, and then use that distribution to sample from randomly for a set of simulations.

Here is what the data looks like, it is a vector of count observations. It is zero inflated, with a long tailed distribution (maybe zero-inflated negative binomial?).

i.vec=c(0,63,1,4,1,44,2,2,1,0,1,0,0,0,0,1,0,0,3,0,0,2,0,0,0,0,0,2,0,0,0,0,
0,0,0,0,0,0,0,0,6,1,11,1,1,0,0,0,2)

And here are some basic parameters that I have calculated. I am using standard deviation for sigma, and phi is the proportion of zeroes in the data.

m=mean(i.vec)
#[1] 3.040816
sig=sd(i.vec)
#[1] 10.86078
tab<-table(i.vec)
zero.prop<-as.numeric(tab[1])/sum(as.numeric(tab))
#[1] 0.6122449

As you can see, the standard deviation is much greater than the mean, and I have a very high proportion of zeroes.

Best Answer

You can use Vuong test in pscl package to compare non-nested models. Here is an example

> m1 <- zeroinfl(i.vec ~ 1 | 1, dist = "negbin")
> summary(m1)

Call:
zeroinfl(formula = i.vec ~ 1 | 1, dist = "negbin")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.3730 -0.3730 -0.3730 -0.2503  7.3544 

Count model coefficients (negbin with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.1122     0.3831   2.903  0.00369 ** 
Log(theta)   -1.9256     0.2839  -6.784 1.17e-11 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -9.815     96.462  -0.102    0.919
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 0.1458 
Number of iterations in BFGS optimization: 579 
Log-likelihood: -80.51 on 3 Df


> m2 <- zeroinfl(i.vec ~ 1 | 1, dist = "poisson")
> summary(m2)

Call:
zeroinfl(formula = i.vec ~ 1 | 1, dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-0.7242 -0.7242 -0.7242 -0.4860 14.2795 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.05911    0.08205    25.1   <2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.4561     0.2933   1.555     0.12
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 11 
Log-likelihood: -233.7 on 2 Df


> vuong(m1, m2)
Vuong Non-Nested Hypothesis Test-Statistic: 1.946095 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
in this case:
model1 > model2, with p-value 0.02582165

Vuong test also suggests that the zero-inflated negative binomial provides a better fit to your data compared to the ordinary negative binomial (not shown here, but you can fit both models and compare them).

Related Solutions

Solved – How to test/prove data is zero inflated

This seems like a relatively straightforward (nonlinear) mixed model to me. You have seed pods nested into clusters nested into plants, and you can fit a binomial model with random effects at each stage:

    library(lme4)
    binre <- lmer( pollinated ~ 1 + (1|plant) + (1|cluster), data = my.data, family = binomial)

or with covariates if you have them. If the flowers self-pollinate, then you might see some mild effects due to natural variability in how viable the plants are by themselves. However if most of the variability in the response is driven by say cluster variability, you would have a stronger evidence of pollination by insects that might visit only selected clusters on a plant. Ideally, you would want a non-parametric distribution of the random effects rather than Gaussian: a point mass at zero, for no insect visits, and a point mass at a positive value -- this is essentially the mixture model Michael Chernick thought about. You can fit this with GLLAMM Stata package, I'd be surprised if this were not possible in R.

Probably for a clean experiment, you would want to have the plants inside, or at least in a location with no insect access, and see how many seeds would be pollinated. That would probably answer all your questions in a more methodologically rigorous way.

Solved – Trouble finding good model fit for count data with mixed effects – ZINB or something else

This post has four years, but I wanted to follow on what fsociety said in a comment. Diagnosis of residuals in GLMMs is not straightforward, since standard residual plots can show non-normality, heteroscedasticity, etc., even if the model is correctly specified. There is an R package, DHARMa, specifically suited for diagnosing these type of models.

The package is based on a simulation approach to generate scaled residuals from fitted generalized linear mixed models and generates different easily interpretable diagnostic plots. Here is a small example with the data from the original post and the first fitted model (m1):

library(DHARMa)
sim_residuals <- simulateResiduals(m1, 1000)
plotSimulatedResiduals(sim_residuals)

The plot on the left shows a QQ plot of the scaled residuals to detect deviations from the expected distribution, and the plot on the right represents residuals vs predicted values while performing quantile regression to detect deviations from uniformity (red lines should be horizontal and at 0.25, 0.50 and 0.75).

Additionally, the package has also specific functions for testing for over/under dispersion and zero inflation, among others:

testOverdispersionParametric(m1)

Chisq test for overdispersion in GLMMs

data:  poisson
dispersion = 0.18926, pearSS = 11.35600, rdf = 60.00000, p-value = 1
alternative hypothesis: true dispersion greater 1

testZeroInflation(sim_residuals)

DHARMa zero-inflation test via comparison to expected zeros with 
simulation under H0 = fitted model


data:  sim_residuals
ratioObsExp = 0.98894, p-value = 0.502
alternative hypothesis: more

Best Answer

Related Solutions

Solved – How to test/prove data is zero inflated

Solved – Trouble finding good model fit for count data with mixed effects – ZINB or something else

Related Question