Solved – Non normal residuals for Tweedie GLM

generalized linear modelnonparametricresidualstweedie-distribution

I am using Tweedie GLM as my data contains exact zeroes. However, my stats is weak and want to confirm a few things.

Does Tweedie GLM assume normality of residuals?
Is shapiro.test() the way for finding normality of residuals for a model with Tweedie GLM?
If the data was not normally distributed nor were the residuals can I use "glht" function for post hoc analyses?

Here is the histogram of the response variable. I tried transforming the response variable but was not able to do so. Hence I have used the values as they were obtained from the dataset.

here is the code:

require(statmod)

require(tweedie)

c0 <- tweedie.profile(y~x, data = c, p.vec = seq(1.0, 2.0, 0.01), method = "series")

c0$p.max

c1 <- glm(y~x, data = c, family = tweedie(var.power = 1.11, link.power = 0))

summary(c1)

shapiro.test(residuals(c1))



    Shapiro-Wilk normality test

data:  residuals(c1)
W = 0.81176, p-value < 2.2e-16

Residuals are not normal.

Is the code correct?
Is Tweedie GLM one of the options for a dataset as mine?

Any suggestions welcome. Thanks.

Best Answer

No, a Tweedie GLM assumes that the responses follow a Tweedie distribution so, obviously, neither the data nor the ordinary residuals are expected to follow a normal distribution.
No, a Shapiro test is not at all appropriate. The only practical way to examine residuals from a GLM such as this is to plot the quantile residuals. Unlike other types of residuals, the quantile residuals are normally distributed, even when y follows a mixed discrete-continuous distribution as in this case. For example, make a probability plot of the residuals:
```
res <- qresiduals(c1)
qnorm(res)
```
The plot of residuals vs the covariate would also useful:
```
plot(x, res)
```
Note that these plots are examining whether your fitted model is appropriate as much as they are examining the distribution of y. If the second plot shows a pattern, then that would suggest you might need more or different predictors on your model.
glht claims to work for any GLM, so presumably it will run on a Tweedie GLM. But there seems no reason why you need the glht function. It is easy to test the significance of your model using standard GLM functions in R:
```
summary(c1)
anova(c1, test="F")
```
Why make the analysis more complicated than necessary?
You code looks ok in principle, but obviously we can't vouch for whether your analysis is completely correct from the limited information you've given.
Yes, definitely. From the limited information you've given, this seems the sort of data that Tweedie GLMs are intended for. I might change my mind if you explained the physical meaning of your data, for example what your response variable actually is and what leads to exact zeros but, from what you've said so far, the Tweedie model seems appropriate.

By the way, I assume that you have set var.power=1.11 because that was the estimate from c0$p.max.

Related Solutions

Solved – I log transformed the dependent variable, can I use GLM normal distribution with LOG link function

Can I use GLM normal distribution with LOG link function on a DV that has already been log transformed?

Yes; if the assumptions are satisfied on that scale

Is the variance homogeneity test sufficient to justify using normal distribution?

Why would equality of variance imply normality?

Is the residual checking procedure correct to justify choosing the link function model?

You should beware of using both histograms and goodness of fit tests to check the suitability of your assumptions:

1) Beware using the histogram for assessing normality. (Also see here)

In short, depending on something as simple as a small change in your choice of binwidth, or even just the location of the bin boundary, it's possible to get quite different impresssions of the shape of the data:

Two histograms of residuals

That's two histograms of the same data set. Using several different binwidths can be useful in seeing whether the impression is sensitive to that.

2) Beware using goodness of fit tests for concluding that the assumption of normality is reasonable. Formal hypothesis tests don't really answer the right question.

e.g. see the links under item 2. here

About the variance, that was mentioned in some papers using similar datasets "because distributions had homogeneous variances a GLM with a Gaussian distribution was used". If this is not correct, how can I justify or decide the distribution?

In normal circumstances, the question isn't 'are my errors (or conditional distributions) normal?' - they won't be, we don't even need to check. A more relevant question is 'how badly does the degree of non-normality that's present impact my inferences?"

I suggest a kernel density estimate or normal QQplot (plot of residuals vs normal scores). If the distribution looks reasonably normal, you have little to worry about. In fact, even when it's clearly non-normal it still may not matter very much, depending on what you want to do (normal prediction intervals really will rely on normality, for example, but many other things will tend to work at large sample sizes)

Funnily enough, at large samples, normality becomes generally less and less crucial (apart from PIs as mentioned above), but your ability to reject normality becomes greater and greater.

Edit: the point about equality of variance is that really can impact your inferences, even at large sample sizes. But you probably shouldn't assess that by hypothesis tests either. Getting the variance assumption wrong is an issue whatever your assumed distribution.

I read that scaled deviance should be around N-p for the model for a good fit right?

When you fit a normal model it has a scale parameter, in which case your scaled deviance will be about N-p even if your distribution isn't normal.

in your opinion the normal distribution with log link is a good choice

In the continued absence of knowing what you're measuring or what you're using the inference for, I still can't judge whether to suggest another distribution for the GLM, nor how important normality might be to your inferences.

However, if your other assumptions are also reasonable (linearity and equality of variance should at least be checked and potential sources of dependence considered), then in most circumstances I'd be very comfortable doing things like using CIs and performing tests on coefficients or contrasts - there's only a very slight impression of skewness in those residuals, which, even if it's a real effect, should have no substantive impact on those kinds of inference.

In short, you should be fine.

(While another distribution and link function might do a little better in terms of fit, only in restricted circumstances would they be likely to also make more sense.)

Solved – Checking residuals for normality in generalised linear models

NB the deviance (or Pearson) residuals are not expected to have a normal distribution except for a Gaussian model. For the logistic regression case, as @Stat says, deviance residuals for the $i$th observation $y_i$ are given by

$$r^{\mathrm{D}}_i=-\sqrt{2\left|\log{(1-\hat{\pi}_i)}\right|}$$

if $y_i=0$ &

$$r^{\mathrm{D}}_i=\sqrt{2\left|\log{(\hat{\pi}_i)}\right|}$$

if $y_i=1$, where $\hat{\pi_i}$ is the fitted Bernoulli probability. As each can take only one of two values, it's clear their distribution cannot be normal, even for a correctly specified model:

#generate Bernoulli probabilities from true model
x <-rnorm(100)
p<-exp(x)/(1+exp(x))

#one replication per predictor value
n <- rep(1,100)
#simulate response
y <- rbinom(100,n,p)
#fit model
glm(cbind(y,n-y)~x,family="binomial") -> mod
#make quantile-quantile plot of residuals
qqnorm(residuals(mod, type="deviance"))
abline(a=0,b=1)

Q-Q plot n=1

But if there are $n_i$ replicate observations for the $i$th predictor pattern, & the deviance residual is defined so as to gather these up

$$r^{\mathrm{D}}_i=\operatorname{sgn}({y_i-n_i\hat{\pi}_i})\sqrt{2\left[y_i\log{\frac{y_i}{n\hat{\pi}_i}} + (n_i-y_i)\log{\frac{n_i-y_i}{n_i(1-\hat{\pi}_i)}}\right]}$$

(where $y_i$ is now the count of successes from 0 to $n_i$) then as $n_i$ gets larger the distribution of the residuals approximates more to normality:

#many replications per predictor value
n <- rep(30,100)
#simulate response
y<-rbinom(100,n,p)
#fit model
glm(cbind(y,n-y)~x,family="binomial")->mod
#make quantile-quantile plot of residuals
qqnorm(residuals(mod, type="deviance"))
abline(a=0,b=1)

Q-Q plot n=30

Things are similar for Poisson or negative binomial GLMs: for low predicted counts the distribution of residuals is discrete & skewed, but tends to normality for larger counts under a correctly specified model.

It's not usual, at least not in my neck of the woods, to conduct a formal test of residual normality; if normality testing is essentially useless when your model assumes exact normality, then a fortiori it's useless when it doesn't. Nevertheless, for unsaturated models, graphical residual diagnostics are useful for assessing the presence & the nature of lack of fit, taking normality with a pinch or a fistful of salt depending on the number of replicates per predictor pattern.

Best Answer

Related Solutions

Solved – I log transformed the dependent variable, can I use GLM normal distribution with LOG link function

Solved – Checking residuals for normality in generalised linear models

Related Question