Logistic – How to Correct Underdispersion in Logistic Regression for Accurate Generalized Linear Model

dispersiongeneralized linear modellogisticunderdispersion

I created a logistic regression model with four continuous variables and a binary outcome. I divided the residual deviance by the residual degrees of freedom, which equaled 0.63. From my understanding, underdispersion is supposed to be rare, and it essentially results in coefficients that are too conservative. This seems like less of a problem than overdispersion, which results in overconfidence. Does underdispersion need to be corrected?

Best Answer

Getting a residual mean deviance around 0.63 is perfectly normal for binary regression and it does not indicate underdispersion or overdispersion. For binary regression, the residual deviance is determined entirely by the size of the fitted values and not by goodness of fit. If the fitted values are mostly near zero or one, then the deviance will be small. If the fitted values are mostly near 0.5, then the deviance will be large.

There is actually no such thing as underdispersion or overdispersion for binary regression. It is mathematically impossible for a binary variable to have a variance other than given by the binomial formula. See for example

Overdispersion or underdispersion is only possible for binomial regression when the number of cases per binomial observation is larger than one, $n>1$. Even then, you would not usually be concerned by a residual mean deviance of 0.63, which can easily be achieved even when the responses are truly binomial.

Here is a small example with binary data:

> y <- rbinom(1000,size=1,prob=0.005)
> fit <- glm(y~1,family=binomial())
> anova(fit)
Analysis of Deviance Table

Model: binomial, link: logit

Response: y

Terms added sequentially (first to last)


     Df Deviance Resid. Df Resid. Dev
NULL                   999     28.854
> 28/999
[1] 0.02802803

Here y has been simulated to be truly binomial, so there is no true overdispersion or underdispersion. Yet the residual mean deviance is very small, only 0.028. The small deviance is simply a consequence of the fact that the fitted values are much less than 0.5:

> summary(fitted(fit))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.002   0.002   0.002   0.002   0.002   0.002 

The same phenomenon exists even when the data is not binary. For example:

> y <- rbinom(1000,size=3,prob=0.005)
> p <- y/3
> fit <- glm(p~1,family=binomial(),weights=rep(3,1000))
> anova(fit)
Analysis of Deviance Table

Model: binomial, link: logit

Response: p

Terms added sequentially (first to last)


     Df Deviance Resid. Df Resid. Dev
NULL                   999     138.29

Here the data is binomial with $n=3$, with no underdispersion, yet the residual deviance is still very much smaller than the residual degrees of freedom. Again, this is caused by the fact that the fitted probabilities are closer to zero than to 0.5.

Using the deviance as a measure of goodness of fit is only reliable when the number of cases is large and the fitted probabilities are not too close to 0 or 1. There is a discussion of this phenomenon in my book with Peter Dunn on generalized linear models. We developed theoretical conditions that must be satisfied for the residual deviance to treated as a goodness of fit statistic or as a measure of over or underdispersion.

Reference

Dunn PK, Smyth GK (2018). Generalized linear models with examples in R. Springer, New York, NY.

Related Question