Solved – Causes for Underdispersion in Poisson Regression

cluster-sampleoverdispersionpoisson distributionpoisson-regressionunderdispersion

I am working with count data (number of pregnancies per woman), and using glm Poisson (log-link) to model determinants of the former count variable.

From simple descriptives I observe that my data are overdispersed: Mean = 4.18, Variance = 7.14.
However, after fitting the glm Poisson model with the full set of control, if I run dispersiontest from the R package AER I get a statistically significant underdispersion equal to 0.68, p-value=0.000 (to test for underdispersion: alternative = c("less")).

If some (relevant) controls are omitted from the model (i.e. age, dependency ratio, and dummies for provinces), the model results to be equidispersed (0.98, p-value=0.298).

I see that underdispersion is uncommon, and solution exists to solve for it (e.g., Conway–Maxwell–Poisson regression). In fact, when applying this latter model the equidispersion assumption is satisfied.

However, I am concerned with the reason why I get underdispersion when controlling for such relevant covariates.
Given that overdispersion may arise because of omitted variables, or in presence of clustered observations, I am just wondering if in my case controlling for the clustered nature of the data (survey data, 2-stage clustering sampling), is radically "over" reducing the variance.

Best Answer

Undispersion might not be a surprise in this case. From a comment by @Ben Bolker: Suppose (for example) a particular woman has a conditional expected number of pregnancies equal to 1. Density of Poisson(1) is P(0,1,2,3, ...) = {0.37,0.37,0.18,0.06, ...}. Perfectly reasonable to suppose instead that the distribution might look like {0.2,0.6,0.1,0.05,...}, i.e. more concentrated than Poisson. Women don't have children as a Poisson process! (At least not in modern societies.)

As for the extra question in a comment: However, my concern regards the fact that underdispersion in the Poisson regression is coming from an originally overdispersed dependent variable. But, if there is much variation over the dataset in the conditional mean, then, marginally, the variance will be higher than the mean, even it the Poisson assumption is fulfilled, in the conditional distributions. So talking about an originally overdispersed dependent variable does not make much sense. Over/under-dispersion must be judged after all obviously relevant variables (like groups/clusters) are taken into account.

Related Question