Generalized Additive Model – Implications of Keeping a Low Basis Dimension in GAMM

basis functiongeneralized-additive-modelmgcvoverfitting

Some of the smooths in my generalized additive mixed model (GAMM) indicate in mgcv::k.check(m) they want to be more wiggly, but I don't think I have enough data to capture the full complexity of the trend. I know we're supposed to increase the basis dimension if k.check indicates it's too small, but I think a smoother pattern is fine since my main goal is prediction. Can I ignore the low p-values and keep the basis dimension, k, at the default? Also, is there an issue with "using up" all the available EDF to max-out k (i.e. does setting k=15, the max, come with any issues)?

I'm afraid of over-fitting the model. It seems from this post that leaving k at the default, in my case, is fine given my data size constraint, no? Since the EDF and basis dimension are still kind of close when set to the max, is this an indication that I have insufficient data for the interaction and shouldn't bother with increasing k? Should I model the terms separately as CYR.std + fSeason +... instead? Only the full dataset reproduces the effect and it's too large to show here.

# num = abundance, sal = salinity (psu),
# fSite = factor site (N=47),
# fSeason = factor season,
# CYR.std = "0" is first year, "1" is 2nd, etc..

> with(shrimp2, nlevels(interaction(CYR.std, fSite, drop = TRUE)))
[1] 705

> gratia::model_edf(m)
# A tibble: 1 × 2
  model   edf
  <chr> <dbl>
1 m      64.1

m <- gam(num ~ s(sal) + 
           s(water_depth) +
           
           fSeason +
           s(CYR.std, by=fSeason) +
           # All sites change in the same way over (smooth) time within a season, but  
           # each annual trend per season is different from the other.
           # One smooth (amount of smoothing) per season 
           
# Structural components I landed on...

           s(fSite, bs = "re") + 
           # Each site have a different (average) abundance
           # Within fSite variation
           # Captures repeated measures effect
           
           s(fSite, by = fSeason, bs="re") + 
           # Sites in different seasons have different abundance
           # Captures between fSite and fSeason variance
           # I expect all sites in wet season to have a different variance than dry season

           # https://stats.stackexchange.com/questions/331692/random-effect-in-gam-mgcv-package
           
           offset(log(area_sampled)), 
         method = "REML",
         select = TRUE,
         family = nb(link = "log"),
         data = shrimp2)

> k.check(m)
                      k'        edf   k-index p-value
s(sal)                 9  1.7952789 0.8448504  0.1125
s(water_depth)         9  0.6680327 0.8348178  0.0625
s(CYR.std):fSeasonDRY  9  5.8789026 0.7878817  0.0000
s(CYR.std):fSeasonWET  9  8.1699374 0.7878817  0.0000
s(fSite)              47 19.6991443        NA      NA
s(fSite):fSeasonDRY   47 14.3528138        NA      NA
s(fSite):fSeasonWET   47 11.4865389        NA      NA

When k=16

Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
  A term has fewer unique covariate combinations than specified maximum degrees of freedom

Model summary and diagnostics:

> summary(m)

Family: Negative Binomial(1.215) 
Link function: log 

Formula:
num ~ s(sal) + s(water_depth) + fSeason + s(CYR.std, by = fSeason) + 
    s(fSite, bs = "re") + s(fSite, by = fSeason, bs = "re") + 
    offset(log(area_sampled))

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.74130    0.10465 -16.640   <2e-16 ***
fSeasonWET  -0.03695    0.12892  -0.287    0.774    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                         edf Ref.df  Chi.sq p-value    
s(sal)                 1.795      9  15.332 0.00449 ** 
s(water_depth)         0.668      9   2.896 0.08532 .  
s(CYR.std):fSeasonDRY  5.879      9 114.048 < 2e-16 ***
s(CYR.std):fSeasonWET  8.170      9  78.652 < 2e-16 ***
s(fSite)              19.699     46  52.488 0.00661 ** 
s(fSite):fSeasonDRY   14.353     46  29.145 0.03465 *  
s(fSite):fSeasonWET   11.486     46  20.312 0.07910 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.208   Deviance explained = 28.6%
-REML = 1439.7  Scale est. = 1         n = 1328

> mgcv::gam.vcomp(m)

Standard deviations and 0.95 confidence intervals:

                            std.dev        lower        upper
s(sal)1                9.671446e-03 2.935140e-03 3.186794e-02
s(sal)2                2.616725e-03 3.817691e-41 1.793558e+35
s(water_depth)1        2.315768e-05 4.064052e-33 1.319565e+23
s(water_depth)2        8.536465e-02 1.056212e-02 6.899303e-01
s(CYR.std):fSeasonDRY1 3.964975e-01 1.990276e-01 7.898918e-01
s(CYR.std):fSeasonDRY2 1.877162e+00 3.453046e-01 1.020473e+01
s(CYR.std):fSeasonWET1 1.310183e+00 7.074553e-01 2.426415e+00
s(CYR.std):fSeasonWET2 4.328930e+00 9.640227e-01 1.943900e+01
s(fSite)               3.329785e-01 1.854022e-01 5.980226e-01
s(fSite):fSeasonDRY    3.479135e-01 1.673902e-01 7.231237e-01
s(fSite):fSeasonWET    3.005410e-01 1.161431e-01 7.777036e-01

Rank: 11/11

E3NB.sqr <- simulateResiduals(fittedModel = m, plot = TRUE)

enter image description here

draw(m, unconditional = TRUE, parametric = T)

enter image description here

m2 <- gam(num ~ s(sal) + 
           s(water_depth) +
           
           fSeason +
           s(CYR.std, by=fSeason, k=15) + ...

> k.check(m2)
                      k'        edf   k-index p-value
s(sal)                 9  1.8027635 0.8480571  0.1325
s(water_depth)         9  0.2832327 0.8387580  0.0850
s(CYR.std):fSeasonDRY 14  6.4565782 0.8012377  0.0000
s(CYR.std):fSeasonWET 14 11.1928764 0.8012377  0.0000
s(fSite)              47 19.7453617        NA      NA
s(fSite):fSeasonDRY   47 14.3656911        NA      NA
s(fSite):fSeasonWET   47 11.9452456        NA      NA

draw(m2, unconditional = TRUE, parametric = T)

enter image description here

Best Answer

Firstly, I think you could simplify your model a bit as I don't see the point of both of these terms:

  1. s(fSite, bs = "re"), and
  2. s(fSite, by = fSeason, bs = "re")

Both will estimate a separate mean response for each site, with the former penalizing all sites the same regardless of season, while the latter will do the same thing but allow for different levels of penalization between wet a dry seasons. If you think sites have different random abundances between seasons, you could just use s(fSite:fSeason, bs = "re"), whih will give you a random intercept for each site by season combination.

As to the thrust of your main question...

The test for basis size is a heuristic guide, not some infallible hard-and-fast rule that you must obey. It works by comparing adjacent residuals ordered by the covariate of the focal smooth, computing some metric that measure how similar adjacent residuals are, and then computing the same metric on residuals whose order has been permuted.

When you have data that are ordered in time, and especially when you are smoothing by time, as you are here, that heuristic test is likely to be confused by autocorrelation. Indeed, wiggly smooths and strong autocorrelation are often indistinguishable mathematically.

So, the interpretation of your k.check() output could just as plausibly be the result of unmodelled autocorrelation.

I know we're supposed to increase the basis dimension if k.check indicates it's too small...

You don't have to; the test is just a guide which can indicate if there is potentially unmodelled wiggliness.

...but I think a smoother pattern is fine since my main goal is prediction.

This may work, but you are going to induce bias into your estimates if that unmodelled wiggliness is signal that you are interested in rather than noise that you aren't. And if you are doing any inference on the estimated smooths, using credible intervals, etc, these are going to biased and anti-conservative if your residuals are not conditionally i.i.d.

Can I ignore the low p-values and keep the basis dimension, k, at the default?

You can decide not to increase k, but you can't ignore the potential for a problem: you need to understand why the test rejected the null hypothesis. As I mentioned above, this could be due to unmodelled autocorrelation, but it could be due to too small a basis for the temporal components. And those two things could be the same thing.

Also, is there an issue with "using up" all the available EDF to max-out k (i.e. does setting k=15, the max, come with any issues)?

Not really; in general the main downside to fitting with larger k is the increased compute time. You can estimate implausibly wiggly effects of covariates if k is too large, so you can use k as a prior on the upper limit of expected wiggliness. But that is usually an issue with covariates that are not time. I don't think there is a a prior expectation that smooths of time should not be wiggly.

A consideration with time however is how you decompose that temporal effect. You could put it all into a very wiggly smooth (if you have the data), or you could fit a very smooth trend with the short-term dependence modelled using a correlation process in the covariance matrix.

It might also be that your trend shouldn't be smooth; the trend might be better off consider stochastic, or smooth in the sense of a random walk (via an MRF smooth).

I'm afraid of over-fitting the model.

I don't think you need to worry too much about this. When you are modelling time a wiggly trend could be just fine. As I mentioned above, it really depends how you want to model/decompose time. Do you want a simpler smooth trend? If so you can keep k low, but you need to model the autocorrelation some other way. See below.

Some further thoughts, observations

You should also be plotting residuals against predictors to understand how well the temporal smooths capture the underlying trends in the data. Or plot the observed counts and model predictions to see the same thing. You should be looking at whether you are adequately modelling your data and also trying to understand why the smooths might be being estimated as wiggly as they are. Are those oscillations in s(CYR.std) for the wet seasons justified by the data or is the smooth chasing odd years?

An alternative approach to detect unmodelled wiggliness is to fit the model

r <- resid(m)
m_2 <- gam(r ~ s(sal) + s(water_depth) + fSeason + s(CYR.std, by = fSeason) + 
    s(fSite, bs = "re") + s(fSite, by = fSeason, bs = "re"),
  method = "REML",
  family = quasi(link = "identity", var = "constant"),
  data = shrimp2)

where we are fitting the deviance residuals instead of the original response, so we can drop the offset term. We also change the family because we expect these to be distributed mean 0 with constant variance, but we aren't assuming that they have some specific conditional distribution, such as being conditionally Gaussian.

If you can, increase k on the smooths in that model above the k used originally. Also, adjust the model formulation depending on what you end up doing after considering my initial observation at the start of this answer.

When you look at summary(m_2) you should be looking to see if there is any smooth that uses more than 1 effective degree of freedom, and if there is unmodelled wiggliness. Any that do are candidates but having their k increased, when you return to the model for the observed response.

But even this model test may erroneously suggest you have too small initial basis sizes. Again because of autocorrelation. While it is OK to model this autocorrelation by smooths of time, you might prefer a different decomposition; where you include a correlation among residuals to model the short-scale temporal dependence. With the response and family you are trying to fit, this is going to be quite difficult to do in practice.

One simple GEE-like approach, although you need to provide $\rho$, would be to fit using bam() and provide an estimate of $\rho$ to argument rho. To get this estimate, you'll need to use the ACF and that assumes your data are regularly spaced in time. The trick here will be to subset the model residuals into individual time series, by site, within year, I think by the way you are decomposing time. This will yield many ACFs, each yielding an estimate of $\rho$, and you'll need to provide some average of these to rho, plus tell AR.start where each time series starts/ends as per ?bam. Note that this will fit the same AR(1) process in each site/year series, not a separate AR(1) for each, so you're only estimating a single $\rho$.

To judge the success of this, you'll then want to plot the standardized residuals, which are in the $std.rsd (or fit an ACF to those residuals, again one per time series).

If you need anything more complex that this, you might need to in practice try brms::brm() as that can fit smooths and more complex autocorrelation processes.

Related Question