R Negative-Binomial-Distribution – Performing Residual Diagnostics for Negative Binomial Regression

negative-binomial-distributionpoisson-regressionrresiduals

In the residual diagnostics for OLS, I understand what to look to assess any violations (e.g., normality and homoskedasticity of residuals). I was wondering what should one check for in residuals for a negative binomial regression fitted model. How could I do about getting these residual diagnostics in R?

Best Answer

Check out the DHARMa package in R. It uses a simulation based approach with quantile residuals to generate the type of residuals you may be interested in. And it works with glm.nb from MASS.

The essential idea is explained here and goes in three steps:

Simulate plausible responses for each case. You can use the distribution of each regression coefficient (coefficient together with standard error) to generate several sets of coefficients. You can multiply each set of coefficients by the observed predictors for each case to obtain multiple simulated response values for each case.
from the multiple response values for each case, generate the empirical cumulative density function (cdf)
find the value of the empirical cdf that corresponds to the observed response for each case. This is your residual and it ranges from 0 to 1.

If there is no systematic misspecification in your model, all values from the empirical cdf are equally likely. And a histogram of these residuals should be uniformly distributed between 0 and 1.

The package contains additional checks.

Edit:

The steps above are not exactly correct. The biggest difference between my description and what DHARMa does is DHARMa uses the simulate() function in base R, which ignores the variability in the estimated regression coefficients. The Gelman and Hill regression text recommends taking the variability in regression coefficients into account.

A crucial step I forgot to include is: once one has generated the responses, we should place them on the response scale. For example, the predicted variables from logistic regression are logits, so one should place them on the probability scale. Next step would be to randomly generate observed scores using the predicted probabilities. Continuing with the logistic regression example, one can use rbinom() to generate 0-1 outcomes given predicted probabilities.

Additionally, when the responses are integers, such as with binary outcomes or count data models, DHARMa adds random noise to the simulated responses prior to computing the empirical cdf. This step ensures the residuals behave as one would expect in OLS if the model was not misspecified. Without it, you could have a pile up of residuals at 1 if your outcome is binary outcome.

The code for the simulate residuals function in DHARMa is relatively easy to follow for anyone looking to roll their own implementation.

Related Solutions

Count Regression Analysis – Diagnostic Plots for Count Regression

Here is what I usually like doing (for illustration I use the overdispersed and not very easily modelled quine data of pupil's days absent from school from MASS):

Test and graph the original count data by plotting observed frequencies and fitted frequencies (see chapter 2 in Friendly) which is supported by the vcd package in R in large parts. For example, with goodfit and a rootogram:
```
library(MASS)
library(vcd)
data(quine) 
fit <- goodfit(quine$Days) 
summary(fit) 
rootogram(fit)
```
or with Ord plots which help in identifying which count data model is underlying (e.g., here the slope is positive and the intercept is positive which speaks for a negative binomial distribution):
```
Ord_plot(quine$Days)
```
or with the "XXXXXXness" plots where XXXXX is the distribution of choice, say Poissoness plot (which speaks against Poisson, try also type="nbinom"):
```
distplot(quine$Days, type="poisson")
```
Inspect usual goodness-of-fit measures (such as likelihood ratio statistics vs. a null model or similar):
```
mod1 <- glm(Days~Age+Sex, data=quine, family="poisson")
summary(mod1)
anova(mod1, test="Chisq")
```
Check for over / underdispersion by looking at residual deviance/df or at a formal test statistic (e.g., see this answer). Here we have clearly overdispersion:
```
library(AER)
deviance(mod1)/mod1$df.residual
dispersiontest(mod1)
```
Check for influential and leverage points, e.g., with the influencePlot in the car package. Of course here many points are highly influential because Poisson is a bad model:
```
library(car)
influencePlot(mod1)
```
Check for zero inflation by fitting a count data model and its zeroinflated / hurdle counterpart and compare them (usually with AIC). Here a zero inflated model would fit better than the simple Poisson (again probably due to overdispersion):
```
library(pscl)
mod2 <- zeroinfl(Days~Age+Sex, data=quine, dist="poisson")
AIC(mod1, mod2)
```
Plot the residuals (raw, deviance or scaled) on the y-axis vs. the (log) predicted values (or the linear predictor) on the x-axis. Here we see some very large residuals and a substantial deviance of the deviance residuals from the normal (speaking against the Poisson; Edit: @FlorianHartig's answer suggests that normality of these residuals is not to be expected so this is not a conclusive clue):
```
res <- residuals(mod1, type="deviance")
plot(log(predict(mod1)), res)
abline(h=0, lty=2)
qqnorm(res)
qqline(res)
```
If interested, plot a half normal probability plot of residuals by plotting ordered absolute residuals vs. expected normal values Atkinson (1981). A special feature would be to simulate a reference ‘line’ and envelope with simulated / bootstrapped confidence intervals (not shown though):
```
library(faraway)
halfnorm(residuals(mod1))
```
Diagnostic plots for log linear models for count data (see chapters 7.2 and 7.7 in Friendly's book). Plot predicted vs. observed values perhaps with some interval estimate (I did just for the age groups--here we see again that we are pretty far off with our estimates due to the overdispersion apart, perhaps, in group F3. The pink points are the point prediction $\pm$ one standard error):
```
plot(Days~Age, data=quine) 
prs  <- predict(mod1, type="response", se.fit=TRUE)
pris <- data.frame("pest"=prs[[1]], "lwr"=prs[[1]]-prs[[2]], "upr"=prs[[1]]+prs[[2]])
points(pris$pest ~ quine$Age, col="red")
points(pris$lwr  ~ quine$Age, col="pink", pch=19)
points(pris$upr  ~ quine$Age, col="pink", pch=19)
```

This should give you much of the useful information about your analysis and most steps work for all standard count data distributions (e.g., Poisson, Negative Binomial, COM Poisson, Power Laws).

Solved – Diagnostics and residual analysis for Poisson regression

Since recently, you can use the DHARMa R package to transform the residuals of any GL(M)M into a standardized space. Once this is done, you can visually assess / test residual problems such as deviations from the distribution, residual dependency on a predictor, heteroskedasticity or autocorrelation in the normal way. See the package vignette for worked-through examples.

Best Answer

Related Solutions

Count Regression Analysis – Diagnostic Plots for Count Regression

Solved – Diagnostics and residual analysis for Poisson regression

Related Question