Criterion is based upon (informed) model comparisons. You are trying to account for over-dispersion.
Poisson var(x) ~ mu
Neg Binomial var(x) > mu
"Extra" zeros
ZIP var(x) ~ mu
ZIPB var(x) > mu
One active package that you can use is install.packages("pscl")
You can then fit a number of models such as a hurdle model that uses a negative binomial for the counts and a binomial model for modeling the probability of zeros. This would be written something like:
fit <- hurdle(Admission ~ Temperature + Humidity), dist="negbin", data = data)
summary (fit)
Note that the output will have two sets of coefficients: one for the hurdle component and one for the count data. This output also provides an estimate of the theta parameter (overdispersion) of the negative binomial
Or you may want to look at the zero-inflation model
fit1<-zeroinfl(Admissions ~ Temperature + Humidity), data = data,dist="negbin",link="logit")
These models can be examined with AIC (also compare these models to your Poisson model...)
AIC(fit,fit1)
You shouldn't compare the AICs between objects fitted with different software. gam()
is fitted via some fancy code fu in the mgcv package, whereas your gamm()
fit is actually fitted via fancy code in the MASS (glmmPQL()
) and then nlme (lme()
) packages. It would be common for different constants to end up in the log likelihood.
When I read your question, I assumed you wanted to compare a GAM with no random effect to the same model with a random effect(s). To do that, fit the non-random effect model with gamm()
too. For example, using @ACD's example data (from a now-deleted Answer):
set.seed(13)
x = rnorm(1000)
eff = as.factor(round(rnorm(1000)+5))
y = exp(x)*runif(1000)+as.numeric(eff)
plot(x,y)
gam_example = gamm(y ~ s(x), method="REML", family=Gamma(link="log"))
gamm_example = gamm(y ~ s(x), method="REML", random= list(eff = ~ 1),
family = Gamma(link="log"))
Which gives two lines give:
> AIC(gam_example$lme)
[1] -2.136317
> AIC(gamm_example$lme)
[1] -1286.448
Hence there is strong support for the inclusion of the random effect and we can compare the AICs because they have been fitted via the same algorithm and code.
Technically, what @ACD shows (in a now-deleted Answer) is also incorrect. Whilst the two models are comparable in terms of both including a random effect for the eff
variable, the AICs are not comparable because very different algorithms are used in the fitting:
## after running the code from @ACD's answer
> AIC(gam_example)
[1] 1721.197
> AIC(gamm_example$lme)
[1] -1286.448
The difference in AIC is meaningless; in this case, the gamm()
is fitted using a penalised quasi likelihood criterion, where as the gam()
is fitted using a standard REML (in this case) criterion.
Best Answer
This is a nice question.
You don't specify what R packages and functions you use, so I'll take the excellent vignette for the
pscl
package and walk along that. Specifically, I'll jump off the description of a hurdle model in section 3.5.As described in the vignette, you can download the example data file here. Save it locally. Then we can load it, extract the columns we want, load the package and fit a hurdle model:
So, how do we predict from
fm_hurdle0
? Of course, we first have to specify for what we want to predict. As always forpredict
functions, this means filling adata.frame
with regressor information for the scenario we want to predict for:Now, we can look at the specific
predict
method for objects of classhurdle
like this:The key parameter is
type
. By default, this is"response"
, that is, the conditional mean given covariates:Thus, our point prediction for this new instance is 6.68.
However, I'd argue that the point prediction is definitely not everything. After all, we are fitting a specific hurdle model because variance is "nonstandard" - we have zero inflation. So we should definitely try to get an idea of the overall shape of the likely response. This means that we really want the full predictive density. Happily, we can get that using
type="prob"
:Let's visualize that, including the predicted mean as a vertical line:
For double-checking, we can recover the predicted mean from the predictive density (up to numerical accuracy):
You can also have
predict.hurdle
give youtype="count"
andtype="zero"
, which is a somewhat complicated beast - see the help page?predict.hurdle
on this.If you use a different function, this may give you some ideas but may not be enough. You may then want to post a follow-up question focusing on the R aspects at StackOverflow in the R tag.
Finally, as I wrote, I strongly recommend looking not only at the predicted mean, but at the entire predictive density as illustrated above. If you want to check how good your predictive density is and you have some holdout data, you can use discrete proper scoring rules to check the calibration and sharpness of your predictive density. A good place to start is Wei & Held (2014), "Calibration Tests for Count Data", TEST, 23, 787-805.