Solved – Why do qq-plots appear to show normal residuals from a GAM when the underlying distribution is not normal

generalized-additive-modelnormality-assumptionqq-plotrresiduals

Say you do this in R:

g <- rgamma(5000, 4)
t <- rt(5000, 5)

So, now you've got data from the gamma and $t$-distributions.

Then you model these. I use 'gam' from 'mgcv'.

m1 <- gam(g ~1, family=Gamma)
m2 <- gam(g ~1, family=scat)

m3 <- gam(20 + t ~1, family=scat)   # add 20 to get positive values
m4 <- gam(20 + t ~1, family=Gamma)  # ^^^

Then you plot the 4 corresponding normal qq-plots of residuals. (I used resid() to get these.) Above models have nothing to do with the normal distribution, so why is it that just by looking at the normal qq-plots of the residuals, I can tell easily which model is the correct one?

As you can see, the qq-plots of model 1 and 3 (the correct models), are normal. The others are not.

Best Answer

The function you called returns deviance residuals by default. (resid is an alias of residuals which when called on a gam object invokes residuals.gam; see its help)

These are typically considerably more normal looking than raw residuals ($y_i-\hat{\mu_i}$).

For a gamma random variable, the deviance residuals would be

$r_D(i)=\operatorname{sign}(y_i-\hat{\mu}_i)\sqrt{-2\nu [\log(\frac{y_i}{\hat{\mu}_i})-\frac{y_i-\hat{\mu}_i}{\hat{\mu}_i}]}$

(though presumably it would be estimating $\nu$ from the deviance)

In particular, for your model, $\hat{\mu}_i$ will be $\bar y$, and since the sample size is very large we might reasonably approximate it by $\mu$.

If you look at the function $t(x)=\operatorname{sign}(x-1)\sqrt{x-\log(x)-1}$, in the vicinity of $1$ (NB $r_D(i) \propto t(y_i/\hat{\mu}_i)$), it's rather similar to (a linear transformation of) a cube root:

The cube root is an approximate symmetrizing transformation for the gamma, sometimes called the Wilson-Hilferty transformation); note that Anscombe residuals for the gamma are $3(\sqrt[3]{x} - 1)$ applied to $y/\hat\mu$. Both transformations ($t$ and the cube root) would be expected to produce close-to-normal results for gamma variates.

(in implementation $r_D(i)$ may also be adjusted for the observation's influence on its own fitted value by dividing by $\sqrt{1-h_{ii}}$ -- however those are constant for your examples)

In the case of the scaled-t (which is not exponential family), it's not immediately clear from the residuals.gam function what residuals are being used in that case, but it would not be surprising if they were similarly a kind that would be more normal-looking than raw residuals.

Related Solutions

Solved – Which distribution should I use when building a GAM in R

What type of distribution would you consider this data to be?

it's largely irrelevant what that looks like, since you have here the marginal distribution, but the GAM will be modelling the conditional distributions (conditional on the predictors, that is).

My first thought is an f-distribution, but I suppose it could also be a poisson distribution.

If it could be one, it could not possibly be the other, since one is continuous and the other is discrete. But again, it doesn't matter for this problem.

The f-distribution is not a family type. Does anyone know why not?

Because the F is not exponential-family.

If there's a strong need for an F, there are other things you might try (but I see no reason to think that there is any need for an F at all, other than a very superficial similarity of the - irrelevant - shape of the marginal distribution).

Should I go with the poisson distribution?

I don't see why this would be a reasonable choice. Times aren't discrete. Even if that wasn't an issue, I see no reason to expect that the variance will be equal to the mean. Indeed, since you'd expect the spread-mean relationship not to depend on your time-unit, I'd plump for variance proportional to mean-squared (which suggests Gamma or lognormal as two easy options)

My first choice for times would typically be Gamma, with either a log link or an inverse link (or rarely, perhaps identity), depending on my understanding of the circumstances. In some circumstances I might transform times (to speeds or log-times) and try to model those.

There are other possibilities - inverse-Gaussian or Tweedie**, for example

**(not available by default, but there's a package that should make it work with GLMs and GAMs)

Solved – Calculating a risk ratio for specific x values from a GAM model using the mgcv package

This doesn't exactly answer your question, but it might still solve your problem of needing to calculate risk ratios. The epiR package allows you to calculate risk ratios.

I could not get your example to work (see my comment to your question), so here is an example from the package's documentation:

library(epiR) # Used for Risk ratio
library(MASS) # Used for data

dat1 <- birthwt; head(dat1)

## Generate a table of cell frequencies. First set the levels of the outcome
## and the exposure so the frequencies in the 2 by 2 table come out in the
## conventional format:
dat1$low <- factor(dat1$low, levels = c(1,0))
dat1$smoke <- factor(dat1$smoke, levels = c(1,0))
dat1$race <- factor(dat1$race, levels = c(1,2,3))
## Generate the 2 by 2 table. Exposure (rows) = smoke. Outcome (columns) = low.
tab1 <- table(dat1$smoke, dat1$low, dnn = c("Smoke", "Low BW"))
print(tab1)
## Compute the incidence risk ratio and other measures of association:
epi.2by2(dat = tab1, method = "cohort.count", 
conf.level = 0.95, units = 100, outcome = "as.columns")

Best Answer

Related Solutions

Solved – Which distribution should I use when building a GAM in R

Solved – Calculating a risk ratio for specific x values from a GAM model using the mgcv package

Related Question