GAMs – Estimating Separate Smooths for a Factor with Hundreds of Levels Using Generalized Additive Models

biostatisticsgeneralized-additive-modelmgcv

I am fitting a smooth-factor interaction term to a longitudinal microbiome abundance data set using GAMs.

model <- gam(Abundance ~ s(Time, by = Genus) + 
             Batch + s(AnimalId, bs = "re"))

I am interested in estimating separate smooths for each level of the Genus factor. The factor includes hundreds of levels, but I am unsure whether or not this is the right approach for a dataset with sparseness.

Now, there are specific genera that I am interested in based on previous exploratory analysis, but I noticed that when I ran separate models on subsets of the Genus factor, the effective degrees of freedom tended to decrease as a function of the number of levels I was looking at.

My main question(s) are:

Why do the effective degrees of freedom decrease as a function of nlevels(Genus)?
What are the pros/cons of removing features with low sample prevalence or abundance?
Is this a reasonable approach for estimating separate smoothing functions for this type of dataset?

Best Answer

Q1

You probably don't want to use the by mechanism for this many factor levels but any time you do use a factor by smooth you must include in the model as a parametric term the factor. I suspect what's happening is that in order to model the different means, the smooths are changing shape and becoming less wiggly (using fewer EDFs) in order to simply model the differences in mean abundance between the genera.

Your model then should be:

model <- gam(Abundance ~ Genus +
               Batch +
               s(Time, by = Genus) +
               s(AnimalId, bs = "re"),
             family = something,
             method = "REML")

Q2

A big con is that you can no longer get estimates for the genera you left out. A pro is that it might not be possible to model (easily) those low prevalence genera or modelling them might require a much more complex model to capture the differences in abundance and varying mean-variance relationships than if you just modelled the most abundant genera.

Whether either of these is a big deal depends on the questions you are trying to answer.

Q3

I would suggest that for something where you have 10s or more levels of a factor, you might want to rethink using the factor by approach. Instead I would gravitate towards the fs basis type which treats the collection of smooths more like true random effects to give a "random smooth" which includes random intercepts and slopes for each genera plus random wiggly terms. The constraint on the fs that makes it feasible for data of this sort is that we only have one set of smoothing parameters to estimate instead of a separate smoothing parameter for each genera. The implication of this is that we are explicitly assuming that the smooth for each genera has the same overall wiggliness, but potentially different shapes. With the factor by smooth formulation you are explicitly accepting that the smooth for each genera could have a different wiggliness.

As described above, a starting model using the fs basis would be:

model <- gam(Abundance ~ Batch +
               s(Time, Genus, bs = "fs") +
               s(AnimalId, bs = "re"),
             family = something,
             method = "REML")

noting that we have dropped the parametric term Genus as the s(Time, Genus, bs = "fs") includes a random intercept for each level of Genus.

If you have such large amounts of data you should be looking at the mgcv::gamm() function or better gamm4::gamm4() to do the modelling as the fs basis is designed to be more efficient there, or even more likely mgcv::bam() if the model and data are large enough.

Related Solutions

Generalized Additive Model – Plotting GAMs on Response Scale with Univariate Smooths and Random Effects

As @Roland surmised, the jagged edges are due to you predicting multiple values of the response at each value of x1. You get multiple predictions because you plugged in all the values of x6 when you created new_data. That you didn't get a spaghetti plot is likely down to the ordering of the values in the expand.grid call; if x1 were your random effect variable then the plot could have been a visual mess and highlighted the problem more easily.

When predicting from models like this you typically need to provide a data frame of new values for all covariates used to fit the model. You can provide a single value for the covariates you are not interested in, in which case the predicted values are conditioned on those values. You could do this for the random effect also by selecting a single level of the random effect variable to predict for:

new_data <- with(dat,
                 expand.grid(x1 = seq(min(x1), max(x1), length = 200),
                             x2 = median(x2),
                             x3 = median(x3),
                             x4 = median(x4),
                             x5 = median(x5),
                             x6 = factor(x6[1], levels = levels(x6)), # <- here
                             Latitude = median(Latitude),
                             Longitude = median(Longitude)))

In the example I'm choosing a single level from x6 but making sure that new_data$x6 is still a factor with the right levels.

This would condition the predictions on that subject, which may not be what you want.

An alternative is to exclude certain terms from the predictions. Basically this is like setting or forcing the effect of $f(\mathsf{Lon}, \mathsf{Lat}) = 0$, which given how the smooths are set up means the average spatial effect. In terms of the random effects, because these are mean 0 i.i.d. Gaussian random effects, setting the effect of the random effect smooth to 0 is the same as generating population level predictions.

Note: The smooths are centred about the model constant term, so if you have factor terms, then this will mean the predictions are conditioned on the reference level(s) for that/those factor(s); changing the contrasts for those factors will change the interpretation of the model constant term so you should bear that in mind as you might prefer a different set of contrasts.

Regardless, we can exclude those terms that we might consider superfluous or nuisance using the terms or exclude arguments to predict.gam, which have the following description:

terms: if type=="terms" or type="iterms" then only results for the terms (smooth or parametric) named in this array will be returned. Otherwise any smooth terms not named in this array will be set to zero. If NULL then all terms are included.
exclude: if type=="terms" or type="iterms" then terms (smooth or parametric) named in this array will not be returned. Otherwise any smooth terms named in this array will be set to zero. If NULL then no terms are excluded. Note that this is the term names as it appears in the model summary, see example. You can avoid providing the covariates for the excluded terms by setting newdata.guaranteed=TRUE, which will avoid all checks on newdata.

(emphasis added) In your case it is easier perhaps to exclude the terms you want excluding.

Note the bit in italics above. You have to name the smooths in this vector you pass to terms or exclude exactly as {mgcv} knows them. So read them off the summary() output if you are unsure.

If you want to exclude the effects of s(x5), s(x6), and s(Latitude, Longitude) from your predictions then you would do:

excl <- c("s(x5)", "s(x6)", "s(Latitude,Longitude)")
pred  <- predict(model, new_data, exclude = excl,
                 type = "link", se.fit = TRUE, unconditional = TRUE)

Note how I removed spaces in the label for the spatial smooth.

You probably don't want to use newdata.guaranteed=TRUE here unless you are very sure you have everything perfectly correct in the definition of your new_data; it's very easy to mess up factor coding etc.

As a double check, you should confirm that your new_data has as many rows as you planned to plot. Here you wanted to show how the predicted value of your response changed over the range of x1 and generated 200 values of x1 at which you wanted predictions. Hence your new_data should have 200 rows: if it doesn't then you messed something up and need to check how you are creating new_data and invariably you have an issue in the expand.grid call.

Note also here that you will get different predictions if you predict at the median latitude and longitude than if you exclude those effects with exclude as sum to zero constraint imposed on each smooth will centre the smooths at the mean, and hence the 0 represents the mean spatial effect, not the effect on Y at the median location. I may be glossing over some minor implementation details there, but just bear that in mind when choosing values to predict at if you are conditioning on them as you did in your example.

Mixed Model – Using Random Wiggly Smooths for 100s of Factor Levels in Generalized Additive Models

A couple of things you can do:

Turn on a form of multithreaded fitting: use the nthreads argument or a multithreaded BLAS - see the Details section of ?bam for details, or cluster the cluster argument (again see the Details) to estimate the chunked model in parallel.
Drop the s(site, bs = "re") term as random intercepts are already included in the s(year, site, bs = "fs", k = 5) term. This will save you 400+ columns in the model matrix, and these kinds of terms often crank the computational time of a GAM considerably.
Use the samfrac argument to select a subset of the 20,000 rows for initial estimation of the coefficients, before finishing off fitting with the full data.

I would start with 2 but spend time figuring out 1 especially if you are running your models on a modern machine with plenty of cores (the gain is less if your main machine is a laptop as you'll want to fit on all the cores but then won't be able to use your machine while it is fitting) — but do remember to turn off hyperthreading or only use the number of physical cores on your system. 3 can help if you really need it, but the other options should help you more.

When Simon says the fs terms are efficient for gamm(), I think he means compared with their use in gam(); for modest data set sizes like yours and above, I've never had any success using gamm(); I just make sure that I have a lot of available RAM and use multithreaded computation to the extent allowed by gam() or bam(), and leave the models running over lunch or in the background while I'm doing something else.