Zero-Inflation – Can Models for Non-Negative Data Predict Exact Zeros?

generalized linear modelpredictionrtweedie-distributionzero inflation

A Tweedie distribution can model skewed data with a point mass at zero when the parameter $p$ (exponent in the mean-variance relationship) is between 1 and 2.

Similarly a zero-inflated (whether otherwise continuous or discrete) model may have a large number of zeros.

I'm having trouble understanding why it is that when I do prediction or calculate fitted values with these kinds of models, all of the predicted values are non-zero.

Can these models actually predict exact zeros?

For example

library(tweedie)
library(statmod)
# generate data
y <- rtweedie( 100, xi=1.3, mu=1, phi=1)  # xi=p
x <- y+rnorm( length(y), 0, 0.2)
# estimate p
out <- tweedie.profile( y~1, p.vec=seq(1.1, 1.9, length=9))
# fit glm
fit <- glm( y ~ x, family=tweedie(var.power=out$p.max, link.power=0))
# predict
pred <- predict.glm(fit, newdata=data.frame(x=x), type="response")

pred now does not contain any zeros.
I thought the usefulness of models such as the Tweedie distribution comes from its ability to predict exact zeros and the continuous part.

I know that in my example the variable x is not very predictive.

Best Answer

Note that the predicted value in a GLM is a mean.

For any distribution on non-negative values, to predict a mean of 0, its distribution would have to be entirely a spike at 0.

However, with a log-link, you're never going to fit a mean of exactly zero (since that would require $\eta$ to go to $-\infty$).

So your problem isn't a problem with the Tweedie, but far more general; you'd have exactly the same issue with the ordinary Poisson (whether zero-inflated or ordinary Poisson GLM) for example, or a binomial, a 0-1 inflated beta and indeed any other distribution on the non-negative real line.

I thought the usefulness of the Tweedie distribution comes from its ability to predict exact zeros and the continuous part.

Since predicting exact zeros isn't going to occur for any distribution over non-negative values with a log-link, your thinking on this must be mistaken.

One of its attractions is that it can model exact zeros in the data, not that the mean predictions will be 0. [Of course a fitted distribution with nonzero mean can still have a probability of being exactly zero, even though the mean must exceed 0. A suitable prediction interval could well include 0, for example.]

It matters not at all that the fitted distribution includes any substantial proportion of zeros - that doesn't make the fitted mean zero (except in the limit as you go to all zeros).

Note that if you change your link function to say an identity link, it doesn't really solve your problem -- the mean of a non-negative random variable that's not all-zeros will be positive.