Solved – Prediction from Zero Inflated Models

generalized-additive-modelpredictionzero inflation

I have fitted a two stage generalized additive model to zero inflated data ("hurdle model"). I am modeling fishery spatial data (catch rates) as a function of different environmental variables (temperature, Chlorophyll, etc). I would like to predict fish abundances, but I'm kind of lost with the procedure.

I have used GAMs and linear models before, so I'm familiar with the predict functions in R. However, I now have two models for one single prediction (such as fish abundance given probability of occurence, or the prediction of the count model given the prediction of the binomial model). My first approach was to multiply the predictions, but that would assume that both events are independent, which is not the case. I also explored conditional probability, but without much success.

Best Answer

This is a nice question.

You don't specify what R packages and functions you use, so I'll take the excellent vignette for the pscl package and walk along that. Specifically, I'll jump off the description of a hurdle model in section 3.5.

As described in the vignette, you can download the example data file here. Save it locally. Then we can load it, extract the columns we want, load the package and fit a hurdle model:

load("DebTrivedi.rda")
dt <- DebTrivedi[, c(1, 6:8, 13, 15, 18)]
require(pscl)
fm_hurdle0 <- hurdle(ofp ~ ., data = dt, dist = "negbin")

So, how do we predict from fm_hurdle0? Of course, we first have to specify for what we want to predict. As always for predict functions, this means filling a data.frame with regressor information for the scenario we want to predict for:

newdata <- data.frame(hosp=1,health="average",numchron=2,
    gender="male",school=10,privins="yes")

Now, we can look at the specific predict method for objects of class hurdle like this:

?predict.hurdle

The key parameter is type. By default, this is "response", that is, the conditional mean given covariates:

predict(fm_hurdle0, newdata=newdata)
       1 
6.682801

Thus, our point prediction for this new instance is 6.68.

However, I'd argue that the point prediction is definitely not everything. After all, we are fitting a specific hurdle model because variance is "nonstandard" - we have zero inflation. So we should definitely try to get an idea of the overall shape of the likely response. This means that we really want the full predictive density. Happily, we can get that using type="prob":

predictive.density <- predict(fm_hurdle0, type="prob",newdata=newdata)
predictive.density
          0          1         2          3          4          5 ...
1 0.0876322 0.09924771 0.0983676 0.09212974 0.08377478 0.07480663 ...

Let's visualize that, including the predicted mean as a vertical line:

plot(as.numeric(colnames(predictive.density)),predictive.density,
    col="grey",type="h",lwd=3,
    xlab="Outcome",ylab="Predicted probability")
abline(v=predict(fm_hurdle0, newdata=newdata))

predictive density

For double-checking, we can recover the predicted mean from the predictive density (up to numerical accuracy):

sum(as.numeric(colnames(predictive.density))*predictive.density)
[1] 6.682788

You can also have predict.hurdle give you

  • the predicted mean from the count component (without zero hurdle) with type="count" and
  • the predicted ratio of probabilities for observing a non-zero count with type="zero", which is a somewhat complicated beast - see the help page ?predict.hurdle on this.

If you use a different function, this may give you some ideas but may not be enough. You may then want to post a follow-up question focusing on the R aspects at StackOverflow in the R tag.

Finally, as I wrote, I strongly recommend looking not only at the predicted mean, but at the entire predictive density as illustrated above. If you want to check how good your predictive density is and you have some holdout data, you can use discrete proper scoring rules to check the calibration and sharpness of your predictive density. A good place to start is Wei & Held (2014), "Calibration Tests for Count Data", TEST, 23, 787-805.

Related Question