GAMs – Estimating Separate Smooths for a Factor with Hundreds of Levels Using Generalized Additive Models

biostatisticsgeneralized-additive-modelmgcv

I am fitting a smooth-factor interaction term to a longitudinal microbiome abundance data set using GAMs.

model <- gam(Abundance ~ s(Time, by = Genus) + 
             Batch + s(AnimalId, bs = "re"))

I am interested in estimating separate smooths for each level of the Genus factor. The factor includes hundreds of levels, but I am unsure whether or not this is the right approach for a dataset with sparseness.

Now, there are specific genera that I am interested in based on previous exploratory analysis, but I noticed that when I ran separate models on subsets of the Genus factor, the effective degrees of freedom tended to decrease as a function of the number of levels I was looking at.

My main question(s) are:

  • Why do the effective degrees of freedom decrease as a function of nlevels(Genus)?
  • What are the pros/cons of removing features with low sample prevalence or abundance?
  • Is this a reasonable approach for estimating separate smoothing functions for this type of dataset?

Best Answer

Q1

You probably don't want to use the by mechanism for this many factor levels but any time you do use a factor by smooth you must include in the model as a parametric term the factor. I suspect what's happening is that in order to model the different means, the smooths are changing shape and becoming less wiggly (using fewer EDFs) in order to simply model the differences in mean abundance between the genera.

Your model then should be:

model <- gam(Abundance ~ Genus +
               Batch +
               s(Time, by = Genus) +
               s(AnimalId, bs = "re"),
             family = something,
             method = "REML")

Q2

A big con is that you can no longer get estimates for the genera you left out. A pro is that it might not be possible to model (easily) those low prevalence genera or modelling them might require a much more complex model to capture the differences in abundance and varying mean-variance relationships than if you just modelled the most abundant genera.

Whether either of these is a big deal depends on the questions you are trying to answer.

Q3

I would suggest that for something where you have 10s or more levels of a factor, you might want to rethink using the factor by approach. Instead I would gravitate towards the fs basis type which treats the collection of smooths more like true random effects to give a "random smooth" which includes random intercepts and slopes for each genera plus random wiggly terms. The constraint on the fs that makes it feasible for data of this sort is that we only have one set of smoothing parameters to estimate instead of a separate smoothing parameter for each genera. The implication of this is that we are explicitly assuming that the smooth for each genera has the same overall wiggliness, but potentially different shapes. With the factor by smooth formulation you are explicitly accepting that the smooth for each genera could have a different wiggliness.

As described above, a starting model using the fs basis would be:

model <- gam(Abundance ~ Batch +
               s(Time, Genus, bs = "fs") +
               s(AnimalId, bs = "re"),
             family = something,
             method = "REML")

noting that we have dropped the parametric term Genus as the s(Time, Genus, bs = "fs") includes a random intercept for each level of Genus.

If you have such large amounts of data you should be looking at the mgcv::gamm() function or better gamm4::gamm4() to do the modelling as the fs basis is designed to be more efficient there, or even more likely mgcv::bam() if the model and data are large enough.