Solved – Which diagnostics can validate the use of a particular family of GLM

gamma distributiongeneralized linear modelstata

This seems so elementary, but I always get stuck at this point…

Most of the data I deal with are non-normal, and most of the analyses based on a GLM structure. For my current analysis, I have a response variable that is "walking speed" (meters/minute). It's easy for me to identify that I cannot use OLS, but then, I have great uncertainty in deciding what family (Gamma, Weibull, etc.) is appropriate!

I use Stata and look at diagnostics like residuals and heteroscedasticity, residuals vs. fitted values, etc.

I am aware that count data can take the form of a rate (e.g. incidence rates) and have used gamma (the analog to overdispersed discrete negative binomial models), but just would like a "smoking gun" to say YES, YOU HAVE THE RIGHT FAMILY. Is looking at the standardized residuals versus the fitted values the only, and best way, to do this? I would like to use a mixed model to account for some hierarchy in the data as well, but first need to sort out what family best describes my response variable.

Any help appreciated. Stata language especially appreciated!

Best Answer

I have some tips :

(1) How residuals ought to compare to fits isn't always all that obvious, so it's good to be familiar with diagnostics for particular models. In logistic regression models, for example, the Hosmer-Lemeshow statistic is used to assess goodness of fit; leverage values tend to be small where the estimated odds are very large, very small or about even; & so on.

(2) Sometimes one family of models can be seen as a special case of another, so you can use a hypothesis test on a parameter to help you choose. Exponential vs Weibull, for example.

(3) Akaike's Information Criterion is useful in choosing between different models, which includes choosing between different families.

(4) Theoretical/empirical knowledge about what you're modelling narrows the field of plausible models.

But there's no automatic way of finding the 'right' family; real-life data can come from distributions as complicated as you like, & the complexity of models that are worth trying to fit increases with the amount of data you have. This is part & parcel of Box's dictum that no models are true but some are useful.

Re @gung's comment: it appears the commonly used Hosmer-Lemeshow test is (a) surprisingly sensitive to the choice of bins, & (b) generally less powerful than some other tests against some relevant classes of alternative hypothesis. That doesn't detract from point (1): it's also good to be up-to-date.

Related Solutions

Count Regression Analysis – Diagnostic Plots for Count Regression

Here is what I usually like doing (for illustration I use the overdispersed and not very easily modelled quine data of pupil's days absent from school from MASS):

Test and graph the original count data by plotting observed frequencies and fitted frequencies (see chapter 2 in Friendly) which is supported by the vcd package in R in large parts. For example, with goodfit and a rootogram:
```
library(MASS)
library(vcd)
data(quine) 
fit <- goodfit(quine$Days) 
summary(fit) 
rootogram(fit)
```
or with Ord plots which help in identifying which count data model is underlying (e.g., here the slope is positive and the intercept is positive which speaks for a negative binomial distribution):
```
Ord_plot(quine$Days)
```
or with the "XXXXXXness" plots where XXXXX is the distribution of choice, say Poissoness plot (which speaks against Poisson, try also type="nbinom"):
```
distplot(quine$Days, type="poisson")
```
Inspect usual goodness-of-fit measures (such as likelihood ratio statistics vs. a null model or similar):
```
mod1 <- glm(Days~Age+Sex, data=quine, family="poisson")
summary(mod1)
anova(mod1, test="Chisq")
```
Check for over / underdispersion by looking at residual deviance/df or at a formal test statistic (e.g., see this answer). Here we have clearly overdispersion:
```
library(AER)
deviance(mod1)/mod1$df.residual
dispersiontest(mod1)
```
Check for influential and leverage points, e.g., with the influencePlot in the car package. Of course here many points are highly influential because Poisson is a bad model:
```
library(car)
influencePlot(mod1)
```
Check for zero inflation by fitting a count data model and its zeroinflated / hurdle counterpart and compare them (usually with AIC). Here a zero inflated model would fit better than the simple Poisson (again probably due to overdispersion):
```
library(pscl)
mod2 <- zeroinfl(Days~Age+Sex, data=quine, dist="poisson")
AIC(mod1, mod2)
```
Plot the residuals (raw, deviance or scaled) on the y-axis vs. the (log) predicted values (or the linear predictor) on the x-axis. Here we see some very large residuals and a substantial deviance of the deviance residuals from the normal (speaking against the Poisson; Edit: @FlorianHartig's answer suggests that normality of these residuals is not to be expected so this is not a conclusive clue):
```
res <- residuals(mod1, type="deviance")
plot(log(predict(mod1)), res)
abline(h=0, lty=2)
qqnorm(res)
qqline(res)
```
If interested, plot a half normal probability plot of residuals by plotting ordered absolute residuals vs. expected normal values Atkinson (1981). A special feature would be to simulate a reference ‘line’ and envelope with simulated / bootstrapped confidence intervals (not shown though):
```
library(faraway)
halfnorm(residuals(mod1))
```
Diagnostic plots for log linear models for count data (see chapters 7.2 and 7.7 in Friendly's book). Plot predicted vs. observed values perhaps with some interval estimate (I did just for the age groups--here we see again that we are pretty far off with our estimates due to the overdispersion apart, perhaps, in group F3. The pink points are the point prediction $\pm$ one standard error):
```
plot(Days~Age, data=quine) 
prs  <- predict(mod1, type="response", se.fit=TRUE)
pris <- data.frame("pest"=prs[[1]], "lwr"=prs[[1]]-prs[[2]], "upr"=prs[[1]]+prs[[2]])
points(pris$pest ~ quine$Age, col="red")
points(pris$lwr  ~ quine$Age, col="pink", pch=19)
points(pris$upr  ~ quine$Age, col="pink", pch=19)
```

This should give you much of the useful information about your analysis and most steps work for all standard count data distributions (e.g., Poisson, Negative Binomial, COM Poisson, Power Laws).

Solved – glm model fit – can’t find a family/link combination that produces good fit

Convergence problems aside, in any glm, you have two issues:

How does your mean $E(Y)$ depend on covariates $X$? This is what should determine your link function.
How does your variance $V(Y)$ depend on $E(Y)$. This is what should determine your family.

1.is more important than 2. because you can fix 2. by using robust variance (sandwich estimates), or bootstrapping. If you get 1. wrong, then your coefficient estimates become hard to interpret.

How about a quasi-Poisson model, with an identity link, and perhaps make your model a bit more flexible by including a regression spline for your prison length "score"?

Best Answer

Related Solutions

Count Regression Analysis – Diagnostic Plots for Count Regression

Solved – glm model fit – can’t find a family/link combination that produces good fit

Related Question