Solved – Family of GLM represents the distribution of the response variable or residuals

assumptionsgeneralized linear modelresiduals

I have been discussing with several lab members about this one, and we have gone to several sources but still don't quite have the answer:

When we say a GLM has a family of poisson let's say are we talking about the distribution of the residuals or the response variable?

Points of contention

Reading this article it states that the assumptions of the GLM are The statistical independence of observations, the correct specification of the link and variance function (which makes me think about the residuals, not the response variable), the correct scale of measurement for response variable and lack of undue influence of single points
This question has two answers with two points each, the one that appears first talks about the residuals, and the second one about the response variable, which is it?
In this blogpost, when talking about assumptions, they state "The distribution of the residuals can be other, eg, binomial"
At the beginning of this chapter they say that the structure of the errors has to be Poisson, but the residuals will surely have positive and negative values, how can that be Poisson?
This question, which often is cited in questions such as this one to make them duplicated does not have an accepted answer
This question the answers talk about response and not residuals
In this course description from the University of Pensilvania they talk about the response variable in the assumptions, not the residuals

Best Answer

The family argument for glm models determines the distribution family for the conditional distribution of the response, not of the residuals (except for the quasi-models).

Look at this way: For the usual linear regression, we can write the model as $$Y_i \sim \text{Normal}(\beta_0+x_i^T\beta, \sigma^2). $$ This means that the response $Y_i$ has a normal distribution (with constant variance), but the expectation is different for each $i$. Therefore the conditional distribution of the response is a normal distribution (but a different one for each $i$). Another way of writing this model is $$ Y_i = \beta_0+x_i^T\beta + \epsilon_i $$ where each $\epsilon_i$ is distributed $\text{Normal}(0, \sigma^2)$.

So for the normal distribution family both descriptions are correct (when interpreted correctly). This is because for the normal linear model we have a clean separation in the model of the systematic part (the $\beta_0+x_i^T\beta$) and the disturbance part (the $\epsilon_i$) which are simply added. But for other family functions, this separation is not possible! There is not even a clean definition of what residual means (and for that reason, many different definitions of "residual").

So for all those other families, we use a definition in the style of the first displayed equation above. That is, the conditional distribution of the response. So, no, the residuals (whatever defined) in Poisson regression do not have a Poisson distribution.

Related Solutions

Solved – How to get the residuals for a glm with a binary response variable using R

You can use the DHARMa package, which implements the idea of randomized quantile residuals by Dunn and Smyth (1996).

Essentially, the idea is to simulate new data from the fitted model, and compare to the observed data. Details see https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html

Here an example with a missing quadratic effect in the glm, which shows up in the right plot.

library(DHARMa)

dat = createData(replicates = 1, sampleSize = 300, intercept = -3,
           fixedEffects = 1, quadraticFixedEffects = 20, 
           randomEffectVariance = 0, family = binomial())

fit = glm(observedResponse ~ Environment1 , data = dat, family = binomial)
res = simulateResiduals(fit)
plot(res)

Solved – Non normal residuals for Tweedie GLM

No, a Tweedie GLM assumes that the responses follow a Tweedie distribution so, obviously, neither the data nor the ordinary residuals are expected to follow a normal distribution.
No, a Shapiro test is not at all appropriate. The only practical way to examine residuals from a GLM such as this is to plot the quantile residuals. Unlike other types of residuals, the quantile residuals are normally distributed, even when y follows a mixed discrete-continuous distribution as in this case. For example, make a probability plot of the residuals:
```
res <- qresiduals(c1)
qnorm(res)
```
The plot of residuals vs the covariate would also useful:
```
plot(x, res)
```
Note that these plots are examining whether your fitted model is appropriate as much as they are examining the distribution of y. If the second plot shows a pattern, then that would suggest you might need more or different predictors on your model.
glht claims to work for any GLM, so presumably it will run on a Tweedie GLM. But there seems no reason why you need the glht function. It is easy to test the significance of your model using standard GLM functions in R:
```
summary(c1)
anova(c1, test="F")
```
Why make the analysis more complicated than necessary?
You code looks ok in principle, but obviously we can't vouch for whether your analysis is completely correct from the limited information you've given.
Yes, definitely. From the limited information you've given, this seems the sort of data that Tweedie GLMs are intended for. I might change my mind if you explained the physical meaning of your data, for example what your response variable actually is and what leads to exact zeros but, from what you've said so far, the Tweedie model seems appropriate.

By the way, I assume that you have set var.power=1.11 because that was the estimate from c0$p.max.

Points of contention

Best Answer

Related Solutions

Solved – How to get the residuals for a glm with a binary response variable using R

Solved – Non normal residuals for Tweedie GLM

Related Question