Solved – Measure of “deviance” for zero-inflated Poisson or zero-inflated negative binomial

deviancegoodness of fitzero inflation

Scaled deviance, defined as D = 2 * (log-likelihood of saturated model minus log-likelihood of fitted model), is often used as a measure of goodness-of-fit in GLM models. Percent deviance explained, defined as [D(null model) – D(fitted model)] / D(null model), is also sometimes used as the GLM analog to linear regression's R-squared. Aside from the fact that ZIP and ZINB distributions are not part of the exponential family of distributions, I'm having trouble understanding why scaled deviance and percent deviance explained are not used in zero-inflated modeling. Can anyone shed some light on this or provide helpful references? Thanks in advance!

Best Answer

The deviance is a GLM concept, ZIP and ZINB models are not glms but are formulated as finite mixtures of distributions which are GLMs and therefore can be solved easily via EM algorithm.

These notes describe the theory of deviance concisely. If you read those notes you'll see the proof that the saturated model for the Poisson regression has log-likelihood

$$\ell(\lambda_s)= \sum_{i=1, \forall y_i\neq 0}^n \left[ y_ilog(y_i)-y_i -log(y_i!)\right]$$

which results from the plug-in estimates $y_i =\hat{\lambda}_i$.

I'll proceed now with the ZIP likelihood because the math is simpler, similar results hold for the ZINB. Unfortunately for the ZIP, there is no simple relationship like in the Poisson. The $i$th observations log-likelihood is

$$\ell_i(\phi, \lambda)=Z_ilog(\phi+(1-\phi)e^{-\lambda})+ (1-Z_i)\left[-\lambda +y_ilog(\lambda) -log(y_i!)\right].$$

the $Z_i$ are not observed so to solve this you'd need to take partial derivatives w.r.t. both $\lambda$ and $\phi$, set the equations to 0 and then solve for $\lambda$ and $\phi$. The difficulty here are the $y_i=0$ values, these can go into a $\hat{\lambda}$ or into a $\hat{\phi}$ and it isn't possible without observing $Z_i$ which to put the $y_i=0$ observations into. However, if we knew the $Z_i$ value we wouldn't need a ZIP model because we would have no missing data. The observed data corresponds to the "complete data" likelihood in the EM formalism.

One approach that might be reasonable is to work with the expectation w.r.t. $Z_i$ of the complete data log-likelihood, $\mathbb{E}(\ell_i(\phi, \lambda))$ which removes the $Z_i$ and replaces with an expectation, this is part of what the EM algorithm calculates (the E step) with the most recent updates. I'm unaware of any literature that has studied this approach to $expected$ deviance though.

Also, this question was asked first so I answered this post. However, there is another question on the same topic with a nice comment by Gordon Smyth here: deviance for zero-inflated compound poisson model, continuous data (R) where he mentioned the same response (this is an elaboration of that comment I'd say) plus they mentioned in the comments to the other post a paper which you may want to read. (disclaimer, I have not read the paper referenced)

Related Solutions

Zero-Inflated Poisson Regression – When to Use Zero-Inflated Poisson Regression and Negative Binomial Distribution

I suspect that your problem may be that the default behavior of predict.glm isn't what you think it is.

Specifically, predict used on a glm object will by default gives a response on the scale of the linear predictors, not the response.

This is quite clearly stated in the help (?predict.glm) but seems to trip people up very often (suggesting the default ought to be changed, perhaps; you might like to raise it on the relevant mailing list).

To get the values you want, try predict(model1,type="response")

Logistic Regression – Why Deviance is Not Equal to -2*logLik in R

Sorry for answering my own question, but eventually I found my erronous assumption.

These are grouped data, or, in other words: the same predictor values occur more than once. In this case, even a perfect ("saturated") model cannot predict the response correctly, and the probabilities for the outcome are different from one, thereby resulting in a "saturated deviance" different from zero.

The "saturated model" is therefore the model with $P(Y=1|X=x)=k_i/n_i$, where $n_i$ is the number of samples with $X=x$ and $k_i$ is the number of $Y=1$ among these. The log-likelihood function of the saturated model is thus $$\ell_s = \sum_{i=1}^n \log\left[{n_i \choose k_i} p_i^{k_i}(1-p_i)^{n_i-k_i} \right] \quad\mbox{with}\quad p_i=\frac{k_i}{n_i}$$

Using this expression for computing the "saturated deviance" yields the result reported by glm:

> library(MASS)
> fit <- glm(cbind(Menarche, Total - Menarche) ~ Age, binomial, data=menarche)
> n <- menarche$Total
> k <- menarche$Menarche
> LL.s <- 0
> for (i in 1:length(n)) {
+  LL.s <- LL.s + log(dbinom(k[i], n[i], k[i]/n[i]))
+ }
> as.numeric(-2*logLik(fit) + 2*LL.s)
[1] 26.70345
> fit$deviance
[1] 26.70345

Best Answer

Related Solutions

Zero-Inflated Poisson Regression – When to Use Zero-Inflated Poisson Regression and Negative Binomial Distribution

Logistic Regression – Why Deviance is Not Equal to -2*logLik in R

Related Question