Generalized Linear Model – Comprehensive Understanding of GLMs

generalized linear model

I have scoured around, reading posts on Cross Validated (Difference between logit and probit models) and also looking at references including Dobson and McCullagh and Nelder, e.g. http://www.statsci.org/glm/books.html so I am aware that this topic is well trodden. Nevertheless, I am trying to articulate and formalize my understanding of GLM and while several posts have helped me with that, I am conscious of gaps and the possibility of an unsound foundation of my understanding.

In simple linear regression we have some set of observations $(x_i, y_i)$ pairs and treat $y_i$ as a realization of a random variable, $Y_i$, distributed as $Y_i \sim N(\mu_i, \sigma^2)$. The means ($\mu_i$) depend on the predictor but the variance is constant. We model $\mu_i = \beta_0 + (\beta_1 \times x_i)$ (or matrix equivalent ) which I believe is the same as saying $Y_i = \mu_i + \epsilon_i$ with $\epsilon_i \sim N(0,\sigma^2)$. I am not sure but I think a correct way to state $\mu$ is $\mu_i = E[Y_i|X=x]$. Can someone confirm that?

Anyway, we might transform the response to achieve linearity (maybe taking logs) and in that case we are modelling $log(Y_i) = \alpha_0 + (\alpha_1 \times x_i) + \epsilon_i$ where $\epsilon_i$ now is assumed to have a log normal distribution.

We generalize by decomposing the model into:

  1. A structural component ($\beta_0 + (\beta_1 \times x_i)$)
  2. A link g(.)
  3. A response distribution (or random component) (member of the exponential family – Guassian, bin,gamma etc)

Let's say that the observations are assumed to come from a distribution in the exponential family and to keep things simple assume that it is the Gaussian distribution. Again, the expected value comes out as $E[Y_i|X=x] = \mu_i$, but does so from the first derivative of the b(theta) term in the exponential family form (http://www.amazon.com/Generalized-Edition-Monographs-Statistics-Probability/dp/0412317605) (pg 29). In the case where we assume a binomial distribution, the expected value comes out at np but it still comes from the first derivative of the b(theta) term when the distribution is expressed in the exponential family form. Note, I don't fully appreciate or understand the derivation of the mean or variance at this stage, perhaps someone could provide a layman's explanation?

Instead of modelling the mean as was done for simple linear regression, we now model a transformation of the mean so instead of saying $\mu_i = \beta_0 + (\beta_1 \times x_i)$ we are saying $g(\mu_i) = \eta_i = \beta_0 + (\beta_1 \times x_i)$ where g is some link function (invertible and differentiable). I think this is a key (yet still somewhat confusing to me) distinction between SLR and GLM. In SLR we transform the response ($y_i$) and model that, in GLM we transform the expected value ($\mu_i$ in the case of the Gaussian example) and model that. Another way of saying it is that in the SLR case we model $E[g(Y_i|X=x)] = \beta_0 + (\beta_1 \times x_i)$ but in the GLM world we are modelling $\eta_i = g[E(Y_i|X=x)] = \beta_0 + (\beta_1 \times x_i)$.

My question is around verifying my understanding and statement of the essence of the foundations of GLM and the differences between GLM and the traditional linear model. Thanks.

Best Answer

You have several questions bundled together. My answer is partial and focuses on link function and transformation, which I take to be more different than they seem.

I think it's important to keep the similar but not identical ideas of transformations and link functions distinct. The introductory literature I have seen does not do an especially good job on that, probably because the authors were too smart to realise that other people could get confused. A first approximation is that the link function has a loosely similar role to transformation of the response (outcome, dependent variable), but that aside the differences are crucial.

Focus on the common and relatively simple case of trying to predict $\log Y$ rather than $Y$ with some $\beta_0 + \beta_1 X$. Here the crucial detail is that the regression in no sense knows about the transformation. Rather, it's your decision that it would be a good idea to transform first (I will call this the "before" step). But the regression doesn't know what you did before. It is oblivious of where the data come from and just sees some $Y_\text{different}$. Also the assumption about the error term distribution is still that the error term is normal. Otherwise put, in $$\log Y = Y_\text{different} = \beta_0 + \beta_1 X + \epsilon$$ the first equality is your private knowledge and the second is what defines the regression model. Thinking that the normal assumption about errors corresponds to a lognormal distribution on your original scale is also private (and such errors would be multiplicative not additive).

Similarly, with classical regression there is often an "after" step, in which for example you reverse the transformation to get predictions of the original $Y$, and perhaps even adjust the confidence intervals to correct for side-effects of transformation, at least to a good approximation. But that is nothing to do with the regression. Indeed, this step is not compulsory, and sometimes it is a good idea to stay on a logarithmic scale and think on that scale. (In effect, using units of measurement such as pH or decibels that are logarithmic is a decision of this kind, even if such a decision would be regarded as scientific rather than statistical.)

Contrast this with generalised linear models -- in this example, using a logarithmic link -- in which "before", fitting and "after" stages are tightly linked, indeed inseparable as far as a data analyst is concerned. The link makes the transformation of the response unnecessary, but the model fitting automatically includes the equivalent of the "after" stage, thus yielding predictions on the original scale. The invertibility of the link is naturally crucial here.

All this refers only to transformations of the response. Using a generalised linear model can still mean transforming predictor variables if that is appropriate.

I've found Lane's paper to be very helpful as a fairly informal but trustworthy discussion.

Lane, P.W. 2002. Generalized linear models in soil science. European Journal of Soil Science 53: 241–251. doi: 10.1046/j.1365-2389.2002.00440.x

Related Question