Mixed Model – Using Random Wiggly Smooths for 100s of Factor Levels in Generalized Additive Models

generalized-additive-modelmgcvmixed model

I have a GAM of presence of a bird species from a large dataset. There are over 400 sites in my data, where each site corresponds to a unique observer. Site-level variation is the biggest source of variation in the data. There are 30+ years of data, and 20,000 rows.

My goal is to capture the overall trend in presence through time, and the yearly smooths by site should be like random departures from that overall trend.

Here's the model I have:

    m_occ <- bam(present ~  s(site, bs = "re") + 
                   s(year, site, bs = "fs", k = 5) +
                   te(expertise, llcent) + fye + offset(log.n.periods) +
                   s(tmin) + ti(visit) + ti(year, k = 33) +
                   ti(visit, year),
                 data = df_train,
                 family = "binomial",
                 gamma = 1.4,
                 weights = neg_pos_weights,
                 select = T,
                 discrete = T,
                 control = gam.control(trace = T))

Clearly there's a lot going on, but I want to focus on this part:

s(site, bs = "re") + s(year, site, bs = "fs", k = 5)

Here I have a random intercept for each site and a random wiggly smooth by year for each site. This is great, it does what I want. But, it takes a long time to fit, and I need to do 50 bootstraps across 60 species across 13 study areas.

Is there another approach that would be similar that might speed things up? A linear trend fits efficiently (several seconds vs. 10+ minutes), but I just want to add a little extra wiggle.

s(site, bs = "re") + s(site, by = year, bs = "re")

GAMs are so powerful, I'm optimistic there must be a solution!! In Simon Wood's GAM book this is covered in 7.7.4, but he doesn't talk about situations where you have lots of data. Also, in the factor.smooth vignette/documentation, Simon Wood also mentions that one formulation is especially fast in gamm(), so I tried that, but it sat for 10+ minutes on the first iteration, so I think gamm() can't handle lots of data like bam() can.

Best Answer

A couple of things you can do:

Turn on a form of multithreaded fitting: use the nthreads argument or a multithreaded BLAS - see the Details section of ?bam for details, or cluster the cluster argument (again see the Details) to estimate the chunked model in parallel.
Drop the s(site, bs = "re") term as random intercepts are already included in the s(year, site, bs = "fs", k = 5) term. This will save you 400+ columns in the model matrix, and these kinds of terms often crank the computational time of a GAM considerably.
Use the samfrac argument to select a subset of the 20,000 rows for initial estimation of the coefficients, before finishing off fitting with the full data.

I would start with 2 but spend time figuring out 1 especially if you are running your models on a modern machine with plenty of cores (the gain is less if your main machine is a laptop as you'll want to fit on all the cores but then won't be able to use your machine while it is fitting) — but do remember to turn off hyperthreading or only use the number of physical cores on your system. 3 can help if you really need it, but the other options should help you more.

When Simon says the fs terms are efficient for gamm(), I think he means compared with their use in gam(); for modest data set sizes like yours and above, I've never had any success using gamm(); I just make sure that I have a lot of available RAM and use multithreaded computation to the extent allowed by gam() or bam(), and leave the models running over lunch or in the background while I'm doing something else.

Related Solutions

Solved – Use coefficients of thin plate regression splines in a clustering method

If I understand correctly, I think you want the coefficients from the $gam component:

> coef(test$gam)
(Intercept)     s(x1).1     s(x1).2     s(x1).3     s(x1).4     s(x1).5 
 21.8323526   9.2169405  15.7504889  -3.4709907  16.9314057 -19.4909343 
    s(x1).6     s(x1).7     s(x1).8     s(x1).9     s(x2).1     s(x2).2 
  1.1124505  -3.3807996  21.7637766 -23.5791595   3.2303904  -3.0366406 
    s(x2).3     s(x2).4     s(x2).5     s(x2).6     s(x2).7     s(x2).8 
 -2.0725621  -0.6642467   0.7347857   1.7232242  -0.5078875  -7.7776700 
    s(x2).9 
-12.0056347

Update 1: To get at the basis functions we can use predict(...., type = "lpmatrix") to get $Xp$ the smoothing matrix:

Xp <- predict(test$gam, type = "lpmatrix")

The fitted spline (e.g. for s(x1)) can be recovered then using:

plot(Xp[,2:10] %*% coef(test$gam)[2:10], type = "l")

You can plot this ($Xp$) and see that it is similar to um[[1]]$X

layout(matrix(1:2, ncol = 2))
matplot(um[[1]]$X, type = "l")
matplot(Xp[,1:10], type = "l")
layout(1)

I pondered why these are not exactly the same. ~~Is it because the original basis functions have been subject to the penalised regression during fitting???~~

Update 2: You can make them the same by including the identifiability constraints into your basis functions in um:

um2 <- smoothCon(s(x1), data=data.frame(x1=x1), knots=NULL, 
                 absorb.cons=TRUE)
layout(matrix(1:2, ncol = 2))
matplot(um2[[1]]$X, type = "l", main = "smoothCon()")
matplot(Xp[,2:10], type = "l", main = "Xp matrix") ##!##
layout(1)

Note I have not got the intercept in the line marked ##!##.

You ought to be able to get $Xp$ directly from function PredictMat(), which is documented on same page as smoothCon().

GAMs – Estimating Separate Smooths for a Factor with Hundreds of Levels Using Generalized Additive Models

Q1

You probably don't want to use the by mechanism for this many factor levels but any time you do use a factor by smooth you must include in the model as a parametric term the factor. I suspect what's happening is that in order to model the different means, the smooths are changing shape and becoming less wiggly (using fewer EDFs) in order to simply model the differences in mean abundance between the genera.

Your model then should be:

model <- gam(Abundance ~ Genus +
               Batch +
               s(Time, by = Genus) +
               s(AnimalId, bs = "re"),
             family = something,
             method = "REML")

Q2

A big con is that you can no longer get estimates for the genera you left out. A pro is that it might not be possible to model (easily) those low prevalence genera or modelling them might require a much more complex model to capture the differences in abundance and varying mean-variance relationships than if you just modelled the most abundant genera.

Whether either of these is a big deal depends on the questions you are trying to answer.

Q3

I would suggest that for something where you have 10s or more levels of a factor, you might want to rethink using the factor by approach. Instead I would gravitate towards the fs basis type which treats the collection of smooths more like true random effects to give a "random smooth" which includes random intercepts and slopes for each genera plus random wiggly terms. The constraint on the fs that makes it feasible for data of this sort is that we only have one set of smoothing parameters to estimate instead of a separate smoothing parameter for each genera. The implication of this is that we are explicitly assuming that the smooth for each genera has the same overall wiggliness, but potentially different shapes. With the factor by smooth formulation you are explicitly accepting that the smooth for each genera could have a different wiggliness.

As described above, a starting model using the fs basis would be:

model <- gam(Abundance ~ Batch +
               s(Time, Genus, bs = "fs") +
               s(AnimalId, bs = "re"),
             family = something,
             method = "REML")

noting that we have dropped the parametric term Genus as the s(Time, Genus, bs = "fs") includes a random intercept for each level of Genus.

If you have such large amounts of data you should be looking at the mgcv::gamm() function or better gamm4::gamm4() to do the modelling as the fs basis is designed to be more efficient there, or even more likely mgcv::bam() if the model and data are large enough.

Best Answer

Related Solutions

Solved – Use coefficients of thin plate regression splines in a clustering method

GAMs – Estimating Separate Smooths for a Factor with Hundreds of Levels Using Generalized Additive Models

Q1

Q2

Q3

Related Question