Solved – Why isn’t it ‘wrong’ to use a log link instead of a logit one when doing GLM with a binomial family

binomial distributionepidemiologygeneralized linear modellogisticregression

I am taking a basic biostats class for an epidemiology masters and we were recently told that log-binomial GLM is what we should be using instead of logistic regression because the coefficients are interpretable in terms of probability ratios (risk/prevalence).

Now, what gets me is that this just seems like we are buying into a larger problem out of laziness: a logit model can still estimate probabilities so it should allow for extraction of the corresponding ratios via the right manipulations. On the other hand, choosing a log link amounts to admitting larger than one probabilities and that seems like it would be an issue. I understand that for small p results will be very similar but it seems unnecessary when an adequate method seems to exist already.

Surely there is something I am missing here?

Best Answer

The issue that the linear predictor can take the parameter outside its admissable range is real, but not limited to this case (common examples are seen when using the identity link with Poisson or Gamma GLMs). If the data stay away from the problem area that shouldn't necessarily pose any real difficulty.

However, the two link functions don't correspond exactly, and so they literally fit different models (unless $p$ is very small throughout, in which case there's no real distinction in practice). As such, for particular applications, it's quite possible that one link function is more suitable than another, at least over the range where data are observed.

Further, in some cases, ease of interpretation may be more useful than quality of fit; if the log link fits with some theoretical understanding, for example, it may be preferable.

However, you're quite right that it's not at all difficult to convert predictions of $\text{logit}(p)$ into predictions of $p$ or $\log(p)$ so if the main reason seems to be discomfort with the $\text{logit}$ function it would seem a somewhat poor reason to avoid it. On the other hand if one were to choose log because you expected the conditional expectation of the response to be in that form, or because you wanted to explicitly model the log-mean, then it would make sense to do that.

Related Question