Solved – GLM: verifying a choice of distribution and link function

generalized linear modellink-functionmodelingregression

I have a generalized linear model that adopts a Gaussian distribution and log link function. After fitting the model, I check the residuals: QQ plot, residuals vs predicted values, histogram of residuals (acknowledging that due caution is needed). Everything looks good. This seems to suggest (to me) that the choice of a Gaussian distribution was quite reasonable. Or, at least, that the residuals are consistent with the distribution I used in my model.

Q1: Would it be going too far to state that it validates my choice of distribution?

I chose a log link function because my response variable is always positive, but I'd like some sort of confirmation that it was a good choice.

Q2: Are there any tests, like checking the residuals for the choice of distribution, that can support my choice of link function? (Choosing a link function seems a bit arbitrary to me, as the only guidelines I can find are quite vague and hand-wavey, presumably for good reason.)

Best Answer

  1. This is a variant of the frequently asked question regarding whether you can assert the null hypothesis. In your case, the null would be that the residuals are Gaussian, and visual inspection of your plots (qq-plots, histograms, etc.) constitutes the 'test'. (For a general overview of the issue of asserting the null, it may help to read my answer here: Why do statisticians say a non-significant result means “you can't reject the null” as opposed to accepting the null hypothesis?) In your specific case, you can say that the plots show your residuals are consistent with your assumption of normality, but they don't "validate" the assumption.

  2. You can fit your model using different link functions and compare them, but there isn't a test of a single link function in isolation (this is evidently incorrect, see @Glen_b's answer). In my answer to Difference between logit and probit models (which may be worth reading, although it isn't quite the same), I argue that link functions should be chosen based on:

    1. Knowledge of the response distribution,
    2. Theoretical considerations, and
    3. Empirical fit to the data.

    Within that framework, the canonical link for a Gaussian model would be the identity link. In this case you rejected that possibility, presumably for theoretical reasons. I suspect your thinking was that $Y$ cannot take negative values (note that 'does not happen to' is not the same thing). If so, the log is a reasonable choice a-priori, but it doesn't just prevent $Y$ from becoming negative, it also induces a specific shape to the curvilinear relationship. A standard plot of residuals vs. fitted values (perhaps with a loess fit overlaid) will help you identify if the intrinsic curvature in your data is a reasonable match for the specific curvature imposed by the log link. As I mentioned, you can also try whatever other transformation meets your theoretical criteria that you want and compare the two fits directly.

Related Question