Solved – Model validation after fitting a negative binomial GLM in R

negative-binomial-distributionrvalidation

Ok, I have searched and searched and just have no clue where to start. First, what I would like to do is produce a QQ-plot (or even a readable residual plot) to look at the fit of my model. I guess I just don't understand how the parameters that go into qnbinom() are obtained from the output of MASS::glm.nb(). I am attempting to use probplot() from package e1071, but am unsure of the inputs needed. It would be great if someone having experience with fitting negative binomials could lend a hand.

Secondly, I came across a residual plot here: http://www.stat.cmu.edu/~hseltman/Rclass/R8.R
I can make it to work, but I don't know how to interpret it or if I am using it correctly. Has anyone else used this?

At the moment I am relying on the AIC and plots of fitted vs. actual values to assess the fit of my model and I would like something a little better!

Edit:
Hopefully this will clarify what I am asking. With qnbinom(p, size, prob, mu, lower.tail = TRUE, log.p = FALSE), how do I (or is it even possible to) get the p, size, prob, mu from the output of a glm.nb fit model? From my research, I have found that size is the dispersion parameter, but other than that I'm not sure where to go. I know theta goes in there somehow, just not sure how to get it in the form needed.

Edit 2: Ok, once I have a distplot(), is there a guide to interpreting it? I am fairly positive I have a bad fit because I have a curved plot with a red line going through it (with many points on the tails far from the red line). The prob: ML = 0.011, is this rejecting that the distribution is from the negative binomial specified?

Best Answer

You might find distplot() from the vcd package useful either for the original data (edit: you can't use it on residuals). This plots Friendly's "negativebinomialness plots" and provides how well the negative binomial model fits
distplot(response, type = "nbinomial", ...)

To obtain the parameters: glm.nb uses the "Gamma mixture of Poisson" representation. It is actually a log-linear model that is fitted, so you should get the mean as $\exp(X\beta)$.

For example, let's say your data come from a negbin with mean 5 and theta of 1 (in the alternative representation as described above). Then you can get the mean estimate simply by

set.seed(10)  
df <- data.frame(y=rnbinom(100,size=1,mu=5))  
m0 <- glm.nb(y~1,data=df)  
m0  
exp(coef(m0))  
m0$theta  

which are in this case 5.1 for the mean (pretty close) and 1.6 for the dispersion parameter (pretty far off).

If you fit a model for the conditional mode, you interpret it accordingly as in every other log linear model, see this discussion on stack exchange.

EDIT: If you want to know how to get the mean in a negbin regression model you need to sum up the linear predictor $X\beta$.

For example: I take the quine data and fit

m1 <- glm.nb(Days~Sex,data=quine)

now males are 1 females are 0. To get the mean for males you write

> exp(coef(m1)[1]+coef(m1)[2]*1)  
[1] 17.95455    

and for females

> exp(coef(m1)[1]+coef(m1)[2]*0)     
[1] 15.225  

Now to get the mean you must weight this with the occurence of all females and males which is

> table(quine$Sex)  
 F  M   
80 66  

and hence the mean is

> (80/(66+80))*15.225+(66/(80+66))*17.95455  
[1] 16.45685  

This is confirmed by

> nb0 <- glm.nb(Days ~ 1, data = quine)    
> exp(coef(nb0))  
(Intercept)  
[1] 16.4589

(apart from rounding errors).

Related Question