Solved – Problems interpreting GAM output

basis functiongeneralized-additive-modelmgcvrsplines

I have been advised to run General Additive Models to be able to describe trends in my data, my data being animal harvest numbers by year. I have done so, but have a problem with interpreting the model output. Hopefully it will be sufficient to show you parts of my script, and then three of my graphs with corresponding model output.

The synthax below, fitting both a linear and a nonlinear component:

library(mgcv)
model1 <- gam(Tot~Year+s(Year),family=poisson,data=ds)
model2 <- gam(Tot~Year+s(Year),family=poisson,data=ls)
model3 <- gam(Tot~Year+s(Year),family=poisson,data=nw)

Then I get the summary output for the models:

summary(model1)
summary(model2)
summary(model3)

Then I group the data in 5-year periods, before generating the data for the plots:

YearP=seq(1975,2015,by=5)
model1.pred=predict(model1,newdata=data.frame(Year=YearP),type="response",se.fit=T)
model2.pred=predict(model2,newdata=data.frame(Year=YearP),type="response",se.fit=T)
model3.pred=predict(model3,newdata=data.frame(Year=YearP),type="response",se.fit=T)

I graph the three models in the similar manner, showing both the data and the model output:

library(gplots)
plotCI(x=YearP, y=model1.pred$fit,uiw=2*model1.pred$se.fit,
    col="red",lwd=3,cex=1.2,las=1, 
    xlab="", ylab="Observed and fitted numbers")
points(ds$Year,ds$Tot,pch=19,cex=0.9)
text(1975,60,label="GRAPH 1",cex=1.4,adj=0)

Here's my three graphs xlab is years and ylab is "observed and fitted numbers", the data are the black dots and the model predictions are the red bars:

enter image description here

And here are the outputs for the three models:

enter image description here

I am trying to use the best method possible to explain my data, and I believe this is it. I have tried to google this, and I have read the relevant answers at CrossValidated, but have yet to find explanations that make sense to me. I need someone to tell me what the output means.

I specifically need help to understand the estimate of the parametric coefficient and the p-value, what do they tell me? E.g. in graph 1 there is obviously an increase, and in graph 3 a decrease, but coefficient estimates for both are positive. This is difficult to understand.

Then I also need to know what the "Approximate significance of the smooth terms" tell me, especially the edf and the p-value here?

Extremely grateful for help to understand.

Best Answer

Without data or reproducible example (nor time right now to create one of my own) I'm going to speculate that the cause of the 0 estimate and NAs in the summary() output are due to model identifiability problems.

This is most likely due to your model not being the correct way (in mgcv at least) to fit a model like this. First off, we need to note that the basis expansion of Year contains a linear function. Hence the parametric Year effect plus the basis expansion of Year include two of the same terms. I thought mgcv had some way of identifying these kinds of things and handling it, but it may not be able to do something about it unless it essentially ends up messing with the parametric parts of the model (I'm speculating - I need to dig into this more to be sure).

This linear function in the basis is part of the null space of the basis. So is a flat function, which would be confounded with the model intercept, so it is removed from the initial basis expansion as we could add a scalar value to the intercept and remove that same value from the coefficient for the flat basis function and recover the same fit. Hence there are an infinity of models to choose from and that way madness lies.

What we need to do, if you want to fit a model of the form

$$\hat{y_i} = \beta_0 + \beta_1 \text{Year}_i + f(\text{Year}_i)$$

is to exclude the linear function from the basis, which we do by requesting that the basis expansion for s(Year) not have a null space. In mgcv-speak this is done as follows:

m <- gam(Tot ~ Year + s(Year, bs = 'tp', m = c(2,0)),
         family = poisson, data = ds, method = 'ML')

Here we're being explicit about wanting to use a thin plate regression spline basis (bs = 'tp') — this is the default basis but I don't think the other bases all allow the same control over the null space, so I'm being explicit.

The m argument is how we pass information to the function that generates the basis on how we want to control the generation. Here, the form c(2, 0) says that we want a penalty on the second derivative (2, which is the default), whilst the 0 indicates that we do not want any null space in the resulting basis. We have to specify the 2; we can't use the single integer form as that is interpreted as the order of the basis. Hence we specify both aspects:

c(penalty_order, null_space_size)

Now you can interpret this model as having an intercept, plus a linear parametric term for a linear trend, plus some non-linear deviation from the linear trend.

You could compare that with a model that only included the parametric terms and compare them via a generalised likelihood ratio test via anova()(although do heed the warnings in ?anova.gam about the limits of this test). Or more simply, just look at the summary() output and the test for the smooth function is a test for the non-linearity. If the non-linear effect is small relative to the parametric trend, then it would suggest that the extra non-linearity is not needed. This is encoded in the test shown in the output from summary() for the smooth term.

Note: you might consider centring the Year variable for model fitting so that Year = 0 corresponds to mean(Year), which would put the intercept in the middle(-ish) of the time series, rather than extrapolating all the way out to Year 0

Examples

First a Gaussian example, for a slightly non-linear true function F

set.seed(1)
F <- function(x) exp(2 * x)
scale <- 2
x0 <- runif(100)
f <- F(x0)
y <- f + rnorm(100, 0, scale)

m1 <- gam(y ~ s(x0), family = gaussian, method = 'REML')
m2 <- gam(y ~ x0 + s(x0), family = gaussian, method = 'REML')
m3 <- gam(y ~ x0 + s(x0, m = c(2,0)), family = gaussian, method = 'REML')

Here m1 does what we want but without the decomposition into linear and non-linear terms:

> summary(m1)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x0)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.1928     0.1897   16.83   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
        edf Ref.df    F  p-value    
s(x0) 2.044  2.563 34.4 1.39e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =   0.47   Deviance explained = 48.1%
-REML = 207.16  Scale est. = 3.5973    n = 100

m2 doesn't work in the sense that we can't uniquely identify the parametric linear term, which gets aliased and as reported in the summary output, the model matrix is rank deficient

> summary(m2)

Family: gaussian 
Link function: identity 

Formula:
y ~ x0 + s(x0)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.1928     0.1897   16.83   <2e-16 ***
x0            0.0000     0.0000      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
        edf Ref.df     F  p-value    
s(x0) 2.031  2.547 34.62 1.37e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Rank: 10/11
R-sq.(adj) =   0.47   Deviance explained = 48.1%
-REML =  205.6  Scale est. = 3.5979    n = 100

m3 works as we get effectively the same fitted values as m1

> all.equal(fitted(m1), fitted(m3))
[1] "Mean relative difference: 6.15775e-07"

and we can identify a parametric linear term plus some non-linearity

> summary(m3)

Family: gaussian 
Link function: identity 

Formula:
y ~ x0 + s(x0, m = c(2, 0))

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -0.5069     1.1486  -0.441  0.65996   
x0            7.1443     2.1875   3.266  0.00151 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
        edf Ref.df     F p-value  
s(x0) 1.044      8 0.304  0.0963 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =   0.47   Deviance explained = 48.1%
-REML = 205.84  Scale est. = 3.5973    n = 100

Here's the same idea but for Poisson data

set.seed(1)
scale <- 0.5
ff <- F(x0)
g <- exp(ff * scale)
f <- log(g)
y <- rpois(100, g)

m4 <- gam(y ~ s(x0), family = poisson, method = 'REML')
m5 <- gam(y ~ x0 + s(x0), family = poisson, method = 'REML')
m6 <- gam(y ~ x0 + s(x0, m = c(2,0)), family = poisson, method = 'REML')

Wherein we see the same rank-deficient model matrix in the case of m5, but otherwise the fits are the same

> summary(m4)

Family: poisson 
Link function: log 

Formula:
y ~ s(x0)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.63304    0.05125   31.86   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
        edf Ref.df Chi.sq p-value    
s(x0) 2.918   3.64  476.1  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.885   Deviance explained = 86.8%
-REML = 218.43  Scale est. = 1         n = 100
> summary(m5)

Family: poisson 
Link function: log 

Formula:
y ~ x0 + s(x0)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.63304    0.05125   31.86   <2e-16 ***
x0           0.00000    0.00000      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
        edf Ref.df Chi.sq p-value    
s(x0) 2.918   3.64  476.1  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Rank: 10/11
R-sq.(adj) =  0.885   Deviance explained = 86.8%
-REML = 217.51  Scale est. = 1         n = 100
> summary(m6)

Family: poisson 
Link function: log 

Formula:
y ~ x0 + s(x0, m = c(2, 0))

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.1134     0.4377  -0.259    0.796    
x0            3.3724     0.8254   4.086 4.39e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
        edf Ref.df Chi.sq  p-value    
s(x0) 1.918      8  16.82 2.21e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.885   Deviance explained = 86.8%
-REML =  217.1  Scale est. = 1         n = 100
Related Question