Solved – Assumptions of generalized linear models

generalized linear modellogistic

On page 232 of "An R companion to applied regression" Fox and Weisberg note

Only the Gaussian family has constant variance, and in all other GLMs the conditional variance of y at $\bf{x}$ depends on $\mu(x)$

Earlier, they note that the conditional variance of the Poisson is $\mu$ and that of the binomial is $\frac{\mu(1-\mu)}{N}$.

For the Gaussian, this is a familiar and often checked assumption (homoscedasticity). Similarly, I often see the conditional variance of the Poisson discussed as an assumption of Poisson regression, together with remedies for cases when it is violated (e.g. negative binomial, zero inflated, etc). Yet I never see the conditional variance for the binomial discussed as an assumption in logistic regression. A little Googling did not find any mention of it.

What am I missing here?

EDIT subsequent to @whuber 's comment:

As suggested I am looking through Hosmer & Lemeshow. It is interesting and I think it shows why I (and perhaps others) are confused. For example, the word "assumption" is not in the index to the book. In addition, we have this (p. 175)

In logistic regression we have to rely primarily on visual assessment, as the distribution of the diagnostics under the hypothesis that the model fits is known only in certain limited settings

They show quite a few plots, but concentrate on scatterplots of various residuals vs the estimated probability. These plots (even for a good model, do not have the "blobby" pattern characteristic of similar plots in OLS regression, and so are harder to judge. Further, they show nothing akin to quantile plots.

In R, plot.lm offers a nice default set of plots to assess models; I do not know of an equivalent for logistic regression, although it may be in some package. This may be because different plots would be needed for each type of model. SAS does offer some plots in PROC LOGISTIC.

This certainly seems to be an area of potential confusion!

Best Answer

These plots (even for a good model, do not have the "blobby" pattern characteristic of similar plots in OLS regression, and so are harder to judge. Further, they show nothing akin to quantile plots.

The DHARMa R package solves this problem by simulating from the fitted model to transform the residuals of any GL(M)M into a standardized space. Once this is done, all regular methods for visually and formally assessing residual problems (e.g. qq plots, overdispersion, heteroskedasticity, autocorrelation) can be applied. See the package vignette for worked-through examples.

Regarding the comment of @Otto_K: if homogenous overdispersion is the only problem, it is probably simpler to use an observational-level random effect, which can be implemented with a standard binomial GLMM. However, I think @PeterFlom was concerned also about heteroskedasticity, i.e. a change of the dispersion parameter with some predictor or model predictions. This will not be picked up / corrected by standard overdispersion checks / corrections, but you can see it in DHARMa residual plots. For correcting it, modelling the dispersion as a function of something else in JAGS or STAN is probably the only way at the moment.

Related Solutions

R – Why Confint() Function Yields Different 95% Confidence Interval in Logistic Regression

After having fetched the data from the accompanying website, here is how I would do it:

chdage <- read.table("chdage.dat", header=F, col.names=c("id","age","chd"))
chdage$aged <- ifelse(chdage$age>=55, 1, 0)
mod.lr <- glm(chd ~ aged, data=chdage, family=binomial)
summary(mod.lr)

The 95% CIs based on profile likelihood are obtained with

require(MASS)
exp(confint(mod.lr))

This often is the default if the MASS package is automatically loaded. In this case, I get

                2.5 %     97.5 %
(Intercept) 0.2566283  0.7013384
aged        3.0293727 24.7013080

Now, if I wanted to compare with 95% Wald CIs (based on asymptotic normality) like the one you computed by hand, I would use confint.default() instead; this yields

                2.5 %     97.5 %
(Intercept) 0.2616579  0.7111663
aged        2.8795652 22.8614705

Wald CIs are good in most situations, although profile likelihood-based may be useful with complex sampling strategies. If you want to grasp the idea of how they work, here is a brief overview of the main principles: Confidence intervals by the profile likelihood method, with applications in veterinary epidemiology. You can also take a look at Venables and Ripley's MASS book, §8.4, pp. 220-221.

Generalized Linear Model – How to Decide Which Family of Variance/Link Functions to Use

It depends on the nature of your dependent variable:

Gaussian is for continuous DV (this is ordinary least squares)

Binomial, as you note, is for logistic regression .

Poisson is for count data (non-negative integers). See also quasipoisson.

Gamma is for continuous DV that is always positive (although often you can use Gaussian here, if the mean is $>> 0$ and the sd isn't huge - that is, if all the values are quite far from 0).

Inverse Gaussian is, I believe, used for survival data (time to event).

Best Answer

Related Solutions

R – Why Confint() Function Yields Different 95% Confidence Interval in Logistic Regression

Generalized Linear Model – How to Decide Which Family of Variance/Link Functions to Use

Related Question