Solved – Prediction from Zero Inflated Models

generalized-additive-modelpredictionzero inflation

I have fitted a two stage generalized additive model to zero inflated data ("hurdle model"). I am modeling fishery spatial data (catch rates) as a function of different environmental variables (temperature, Chlorophyll, etc). I would like to predict fish abundances, but I'm kind of lost with the procedure.

I have used GAMs and linear models before, so I'm familiar with the predict functions in R. However, I now have two models for one single prediction (such as fish abundance given probability of occurence, or the prediction of the count model given the prediction of the binomial model). My first approach was to multiply the predictions, but that would assume that both events are independent, which is not the case. I also explored conditional probability, but without much success.

Best Answer

This is a nice question.

You don't specify what R packages and functions you use, so I'll take the excellent vignette for the pscl package and walk along that. Specifically, I'll jump off the description of a hurdle model in section 3.5.

As described in the vignette, you can download the example data file here. Save it locally. Then we can load it, extract the columns we want, load the package and fit a hurdle model:

load("DebTrivedi.rda")
dt <- DebTrivedi[, c(1, 6:8, 13, 15, 18)]
require(pscl)
fm_hurdle0 <- hurdle(ofp ~ ., data = dt, dist = "negbin")

So, how do we predict from fm_hurdle0? Of course, we first have to specify for what we want to predict. As always for predict functions, this means filling a data.frame with regressor information for the scenario we want to predict for:

newdata <- data.frame(hosp=1,health="average",numchron=2,
    gender="male",school=10,privins="yes")

Now, we can look at the specific predict method for objects of class hurdle like this:

?predict.hurdle

The key parameter is type. By default, this is "response", that is, the conditional mean given covariates:

predict(fm_hurdle0, newdata=newdata)
       1 
6.682801

Thus, our point prediction for this new instance is 6.68.

However, I'd argue that the point prediction is definitely not everything. After all, we are fitting a specific hurdle model because variance is "nonstandard" - we have zero inflation. So we should definitely try to get an idea of the overall shape of the likely response. This means that we really want the full predictive density. Happily, we can get that using type="prob":

predictive.density <- predict(fm_hurdle0, type="prob",newdata=newdata)
predictive.density
          0          1         2          3          4          5 ...
1 0.0876322 0.09924771 0.0983676 0.09212974 0.08377478 0.07480663 ...

Let's visualize that, including the predicted mean as a vertical line:

plot(as.numeric(colnames(predictive.density)),predictive.density,
    col="grey",type="h",lwd=3,
    xlab="Outcome",ylab="Predicted probability")
abline(v=predict(fm_hurdle0, newdata=newdata))

For double-checking, we can recover the predicted mean from the predictive density (up to numerical accuracy):

sum(as.numeric(colnames(predictive.density))*predictive.density)
[1] 6.682788

You can also have predict.hurdle give you

the predicted mean from the count component (without zero hurdle) with type="count" and
the predicted ratio of probabilities for observing a non-zero count with type="zero", which is a somewhat complicated beast - see the help page ?predict.hurdle on this.

If you use a different function, this may give you some ideas but may not be enough. You may then want to post a follow-up question focusing on the R aspects at StackOverflow in the R tag.

Finally, as I wrote, I strongly recommend looking not only at the predicted mean, but at the entire predictive density as illustrated above. If you want to check how good your predictive density is and you have some holdout data, you can use discrete proper scoring rules to check the calibration and sharpness of your predictive density. A good place to start is Wei & Held (2014), "Calibration Tests for Count Data", TEST, 23, 787-805.

Related Solutions

Zero-Inflated Poisson Model – Comprehensive Understanding

Criterion is based upon (informed) model comparisons. You are trying to account for over-dispersion.

Poisson var(x) ~ mu

Neg Binomial var(x) > mu

"Extra" zeros

ZIP var(x) ~ mu

ZIPB var(x) > mu
One active package that you can use is install.packages("pscl") You can then fit a number of models such as a hurdle model that uses a negative binomial for the counts and a binomial model for modeling the probability of zeros. This would be written something like:
```
fit <- hurdle(Admission ~ Temperature + Humidity), dist="negbin", data = data)

 summary (fit)
```

Note that the output will have two sets of coefficients: one for the hurdle component and one for the count data. This output also provides an estimate of the theta parameter (overdispersion) of the negative binomial

Or you may want to look at the zero-inflation model

fit1<-zeroinfl(Admissions ~ Temperature + Humidity), data = data,dist="negbin",link="logit")

These models can be examined with AIC (also compare these models to your Poisson model...) AIC(fit,fit1)

Random Effects Model – Why is There a Huge AIC Difference Between GAM and GAMM Models?

You shouldn't compare the AICs between objects fitted with different software. gam() is fitted via some fancy code fu in the mgcv package, whereas your gamm() fit is actually fitted via fancy code in the MASS (glmmPQL()) and then nlme (lme()) packages. It would be common for different constants to end up in the log likelihood.

When I read your question, I assumed you wanted to compare a GAM with no random effect to the same model with a random effect(s). To do that, fit the non-random effect model with gamm() too. For example, using @ACD's example data (from a now-deleted Answer):

set.seed(13)
x = rnorm(1000)
eff = as.factor(round(rnorm(1000)+5))
y = exp(x)*runif(1000)+as.numeric(eff)
plot(x,y)

gam_example = gamm(y ~ s(x), method="REML", family=Gamma(link="log"))
gamm_example = gamm(y ~ s(x), method="REML", random= list(eff = ~ 1), 
                    family = Gamma(link="log"))

Which gives two lines give:

> AIC(gam_example$lme)
    [1] -2.136317
    > AIC(gamm_example$lme)
[1] -1286.448

Hence there is strong support for the inclusion of the random effect and we can compare the AICs because they have been fitted via the same algorithm and code.

Technically, what @ACD shows (in a now-deleted Answer) is also incorrect. Whilst the two models are comparable in terms of both including a random effect for the eff variable, the AICs are not comparable because very different algorithms are used in the fitting:

## after running the code from @ACD's answer
> AIC(gam_example)
[1] 1721.197
> AIC(gamm_example$lme)
[1] -1286.448

The difference in AIC is meaningless; in this case, the gamm() is fitted using a penalised quasi likelihood criterion, where as the gam() is fitted using a standard REML (in this case) criterion.

Best Answer

Related Solutions

Zero-Inflated Poisson Model – Comprehensive Understanding

Random Effects Model – Why is There a Huge AIC Difference Between GAM and GAMM Models?

Related Question