Solved – use bootstrapping to estimate the uncertainty in a maximum value of a GAM

bootstrapgeneralized-additive-modelmonte carlouncertainty

I have data from an experiment where I look at the development of algal biomass as a function of the concentration of a nutrient. The relationship between biomass (the response variable) and the concentration (the explanatory variable) is more or less unimodal, with a clear "optimum" along the x-axis where the biomass peaks.

I have fitted a generalized additive model (GAM) to the data, and am interested in the concentration (the value on the x-axis) which gives the highest biomass (i.e., corresponds to the peak y-value). I have three replicates of biomass from each of 8 different concentrations of the nutrient (24 observations in total). In addition to just knowing at which concentration the GAM peaks, I would like to obtain some kind of uncertainty estimate of the position of the optimum. Is it a valid approach to use bootstrapping here?

I have tried the following procedure:

  1. For each of the eight nutrient concentrations, sample with replacement from the three replicates so that I get a new dataset with 24 observations, but where data from the same replicate may be used more than once.

  2. Fit a GAM to the data and estimate the x-value where the function peaks.

  3. Repeat a large number of times, save the estimate every time, and calculate the standard deviation (or the 2.5 and 97.5 percentiles).

Is this a valid approach or am I in trouble here?

Best Answer

An alternative approach that can be used for GAMs fitted using Simon Wood's mgcv software for R is to do posterior inference from the fitted GAM for the feature of interest. Essentially, this involves simulating from the posterior distribution of the parameters of the fitted model, predicting values of the response over a fine grid of $x$ locations, finding the $x$ where the fitted curve takes its maximal value, repeat for lots of simulated models and compute a confidence for the location of the optima as the quantiles of the distribution of optima from the simulated models.

The meat from what I present below was cribbed from page 4 of Simon Wood's course notes (pdf)

To have something akin to a biomass example, I'm going to simulate a single species' abundance along a single gradient using my coenocliner package.

library("coenocliner")
A0    <- 9 * 10 # max abundance
m     <- 45     # location on gradient of modal abundance
r     <- 6 * 10 # species range of occurence on gradient
alpha <- 1.5    # shape parameter
gamma <- 0.5    # shape parameter
locs  <- 1:100  # gradient locations
pars  <- list(m = m, r = r, alpha = alpha,
              gamma = gamma, A0 = A0) # species parameters, in list form
set.seed(1)
mu <- coenocline(locs, responseModel = "beta", params = pars, expectation = FALSE)

Fit the GAM

library("mgcv")
m <- gam(mu ~ s(locs), method = "REML", family = "poisson")

... predict on a fine grid over the range of $x$ (locs)...

p  <- data.frame(locs = seq(1, 100, length = 5000))
pp <- predict(m, newdata = p, type = "response")

and visualise the fitted function and the data

plot(mu ~ locs)
lines(pp ~ locs, data = p, col = "red")

This produces

enter image description here

The 5000 prediction locations is probably overkill here and certainly for the plot, but depending on the fitted function in your use-case, you might need a fine grid to get close to the maximum of the fitted curve.

Now we can simulate from the posterior of the model. First we get the $Xp$ matrix; the matrix that, once multiplied by model coefficients yields predictions from the model at new locations p

Xp <- predict(m, p, type="lpmatrix") ## map coefs to fitted curves

Next we collect the fitted model coefficients and their (Bayesian) covariance matrix

beta <- coef(m)
Vb   <- vcov(m) ## posterior mean and cov of coefs

The coefficients are a multivariate normal with mean vector beta and covariance matrix Vb. Hence we can simulate from this multivariate normal new coefficients for models consistent with the fitted one but which explore the uncertainty in the fitted model. Here we generate 10000 (n)` simulated models

n <- 10000
library("MASS") ## for mvrnorm
set.seed(10)
mrand <- mvrnorm(n, beta, Vb) ## simulate n rep coef vectors from posterior

Now we can generate predictions for of the n simulated models, transform from the scale of the linear predictor to the response scale by applying the inverse of the link function (ilink()) and then compute the $x$ value (value of p$locs) at the maximal point of the fitted curve

opt <- rep(NA, n)
ilink <- family(m)$linkinv
for (i in seq_len(n)) { 
  pred   <- ilink(Xp %*% mrand[i, ])
  opt[i] <- p$locs[which.max(pred)]
}

Now we compute the confidence interval for the optima using probability quantiles of the distribution of 10,000 optima, one per simulated model

ci <- quantile(opt, c(.025,.975)) ## get 95% CI

For this example we have:

> ci
    2.5%    97.5% 
39.06321 52.39128

We can add this information to the earlier plot:

plot(mu ~ locs)
abline(v = p$locs[which.max(pp)], lty = "dashed", col = "grey")
lines(pp ~ locs, data = p, col = "red")
lines(y = rep(0,2), x = ci, col = "blue")
points(y = 0, x = p$locs[which.max(pp)], pch = 16, col = "blue")

which produces

enter image description here

As we'd expect given the data/observations, the interval on the fitted optima is quite asymmetric.

Slide 5 of Simon's course notes suggests why this approach might be preferred to bootstrapping. Advantages of posterior simulation are that it is quick - bootstrapping GAMs is slow. Two additional issues with bootstrapping are (taken from Simon's notes!)

  1. For parametric bootstrapping the smoothing bias causes problems, the model simulated from is biased and the fits to the samples will be yet more biased.
  2. For non-parametric ‘case-resampling’ the presence of replicate copies of the same data causes undersmoothing, especially with GCV based smoothness selection.

It should be noted that the posterior simulation performed here is conditional upon the chosen smoothness parameters for the the model/spline. This can be accounted for, but Simon's notes suggest this makes little difference if you actually go to the trouble of doing it. (so I haven't here...)

Related Question