Solved – Family of GLM represents the distribution of the response variable or residuals

assumptionsgeneralized linear modelresiduals

I have been discussing with several lab members about this one, and we have gone to several sources but still don't quite have the answer:

When we say a GLM has a family of poisson let's say are we talking about the distribution of the residuals or the response variable?

Points of contention

  1. Reading this article it states that the assumptions of the GLM are The statistical independence of observations, the correct specification of the link and variance function (which makes me think about the residuals, not the response variable), the correct scale of measurement for response variable and lack of undue influence of single points

  2. This question has two answers with two points each, the one that appears first talks about the residuals, and the second one about the response variable, which is it?

  3. In this blogpost, when talking about assumptions, they state "The distribution of the residuals can be other, eg, binomial"

  4. At the beginning of this chapter they say that the structure of the errors has to be Poisson, but the residuals will surely have positive and negative values, how can that be Poisson?

  5. This question, which often is cited in questions such as this one to make them duplicated does not have an accepted answer

  6. This question the answers talk about response and not residuals

  7. In this course description from the University of Pensilvania they talk about the response variable in the assumptions, not the residuals

Best Answer

The family argument for glm models determines the distribution family for the conditional distribution of the response, not of the residuals (except for the quasi-models).

Look at this way: For the usual linear regression, we can write the model as $$Y_i \sim \text{Normal}(\beta_0+x_i^T\beta, \sigma^2). $$ This means that the response $Y_i$ has a normal distribution (with constant variance), but the expectation is different for each $i$. Therefore the conditional distribution of the response is a normal distribution (but a different one for each $i$). Another way of writing this model is $$ Y_i = \beta_0+x_i^T\beta + \epsilon_i $$ where each $\epsilon_i$ is distributed $\text{Normal}(0, \sigma^2)$.

So for the normal distribution family both descriptions are correct (when interpreted correctly). This is because for the normal linear model we have a clean separation in the model of the systematic part (the $\beta_0+x_i^T\beta$) and the disturbance part (the $\epsilon_i$) which are simply added. But for other family functions, this separation is not possible! There is not even a clean definition of what residual means (and for that reason, many different definitions of "residual").

So for all those other families, we use a definition in the style of the first displayed equation above. That is, the conditional distribution of the response. So, no, the residuals (whatever defined) in Poisson regression do not have a Poisson distribution.

Related Question