Solved – use bootstrapping to estimate the uncertainty in a maximum value of a GAM

bootstrapgeneralized-additive-modelmonte carlouncertainty

I have data from an experiment where I look at the development of algal biomass as a function of the concentration of a nutrient. The relationship between biomass (the response variable) and the concentration (the explanatory variable) is more or less unimodal, with a clear "optimum" along the x-axis where the biomass peaks.

I have fitted a generalized additive model (GAM) to the data, and am interested in the concentration (the value on the x-axis) which gives the highest biomass (i.e., corresponds to the peak y-value). I have three replicates of biomass from each of 8 different concentrations of the nutrient (24 observations in total). In addition to just knowing at which concentration the GAM peaks, I would like to obtain some kind of uncertainty estimate of the position of the optimum. Is it a valid approach to use bootstrapping here?

I have tried the following procedure:

For each of the eight nutrient concentrations, sample with replacement from the three replicates so that I get a new dataset with 24 observations, but where data from the same replicate may be used more than once.
Fit a GAM to the data and estimate the x-value where the function peaks.
Repeat a large number of times, save the estimate every time, and calculate the standard deviation (or the 2.5 and 97.5 percentiles).

Is this a valid approach or am I in trouble here?

Best Answer

An alternative approach that can be used for GAMs fitted using Simon Wood's mgcv software for R is to do posterior inference from the fitted GAM for the feature of interest. Essentially, this involves simulating from the posterior distribution of the parameters of the fitted model, predicting values of the response over a fine grid of $x$ locations, finding the $x$ where the fitted curve takes its maximal value, repeat for lots of simulated models and compute a confidence for the location of the optima as the quantiles of the distribution of optima from the simulated models.

The meat from what I present below was cribbed from page 4 of Simon Wood's course notes (pdf)

To have something akin to a biomass example, I'm going to simulate a single species' abundance along a single gradient using my coenocliner package.

library("coenocliner")
A0    <- 9 * 10 # max abundance
m     <- 45     # location on gradient of modal abundance
r     <- 6 * 10 # species range of occurence on gradient
alpha <- 1.5    # shape parameter
gamma <- 0.5    # shape parameter
locs  <- 1:100  # gradient locations
pars  <- list(m = m, r = r, alpha = alpha,
              gamma = gamma, A0 = A0) # species parameters, in list form
set.seed(1)
mu <- coenocline(locs, responseModel = "beta", params = pars, expectation = FALSE)

Fit the GAM

library("mgcv")
m <- gam(mu ~ s(locs), method = "REML", family = "poisson")

... predict on a fine grid over the range of $x$ (locs)...

p  <- data.frame(locs = seq(1, 100, length = 5000))
pp <- predict(m, newdata = p, type = "response")

and visualise the fitted function and the data

plot(mu ~ locs)
lines(pp ~ locs, data = p, col = "red")

This produces

The 5000 prediction locations is probably overkill here and certainly for the plot, but depending on the fitted function in your use-case, you might need a fine grid to get close to the maximum of the fitted curve.

Now we can simulate from the posterior of the model. First we get the $Xp$ matrix; the matrix that, once multiplied by model coefficients yields predictions from the model at new locations p

Xp <- predict(m, p, type="lpmatrix") ## map coefs to fitted curves

Next we collect the fitted model coefficients and their (Bayesian) covariance matrix

beta <- coef(m)
Vb   <- vcov(m) ## posterior mean and cov of coefs

The coefficients are a multivariate normal with mean vector beta and covariance matrix Vb. Hence we can simulate from this multivariate normal new coefficients for models consistent with the fitted one but which explore the uncertainty in the fitted model. Here we generate 10000 (n)` simulated models

n <- 10000
library("MASS") ## for mvrnorm
set.seed(10)
mrand <- mvrnorm(n, beta, Vb) ## simulate n rep coef vectors from posterior

Now we can generate predictions for of the n simulated models, transform from the scale of the linear predictor to the response scale by applying the inverse of the link function (ilink()) and then compute the $x$ value (value of p$locs) at the maximal point of the fitted curve

opt <- rep(NA, n)
ilink <- family(m)$linkinv
for (i in seq_len(n)) { 
  pred   <- ilink(Xp %*% mrand[i, ])
  opt[i] <- p$locs[which.max(pred)]
}

Now we compute the confidence interval for the optima using probability quantiles of the distribution of 10,000 optima, one per simulated model

ci <- quantile(opt, c(.025,.975)) ## get 95% CI

For this example we have:

> ci
    2.5%    97.5% 
39.06321 52.39128

We can add this information to the earlier plot:

plot(mu ~ locs)
abline(v = p$locs[which.max(pp)], lty = "dashed", col = "grey")
lines(pp ~ locs, data = p, col = "red")
lines(y = rep(0,2), x = ci, col = "blue")
points(y = 0, x = p$locs[which.max(pp)], pch = 16, col = "blue")

which produces

As we'd expect given the data/observations, the interval on the fitted optima is quite asymmetric.

Slide 5 of Simon's course notes suggests why this approach might be preferred to bootstrapping. Advantages of posterior simulation are that it is quick - bootstrapping GAMs is slow. Two additional issues with bootstrapping are (taken from Simon's notes!)

For parametric bootstrapping the smoothing bias causes problems, the model simulated from is biased and the fits to the samples will be yet more biased.
For non-parametric ‘case-resampling’ the presence of replicate copies of the same data causes undersmoothing, especially with GCV based smoothness selection.

It should be noted that the posterior simulation performed here is conditional upon the chosen smoothness parameters for the the model/spline. This can be accounted for, but Simon's notes suggest this makes little difference if you actually go to the trouble of doing it. (so I haven't here...)

Related Solutions

Solved – Can you extrapolate values of the dependent variable with a GAM

Four years have passed by and now I'm able of answering my own question. This is indeed a no, you cannot really extrapolate data with a GAM, only in a very limited range quite close to x0 or x1. If you are using splines, you would be extrapolating with a cubic polynomial, which is not very good, since the curve would quickly tend to -infinity or +infinity. There is an example in Simon Wood's book "Generalized Additive Models: An Introduction with R" (exercise 5 in page 400) about this, where it is shown that the extrapolation capacity of a GAM is very limited. I believe indeed that GLMs should be better to extrapolate data.

Solved – Using bootstrap to estimate the 95th percentile and confidence interval for skewed data

Maybe you don't have to adopt normal distribution. Why don't you just use the 2.5% percentile and 97.5% percentile of boot strap sample percentiles as the confidence interval? I simulated usual bootstrap method and it seems work when comparing to the method using binomial distribution.

I don't have your data so I made some data from gamma distribution which is skewed.

#making data
set.seed(1)
x<-rgamma(7000,5,0.3)
hist(x)

x_sorted<-sort(x)

x_sorted[round(7000*0.95)] # estimate of 95% x

This is the bootstrap code I ran.

#method1 bootstrap
bootx95p(x_sorted,1000,0.05)

bootx95p<-function(x,b,alpha){
# x is your data
# b is the number of bootstraps.
# alpha is your type I error

n<-length(x)
p<-round(n*0.95)
xp<-rep(0,b)

for(i in 1:b){
x_boot<-sample(x,n,replace=TRUE)
x_boot<-sort(x_boot)
xp[i]<-x_boot[p]
}

xp<-sort(xp)
a<-round(alpha/2*(b+1))
CIup<-xp[b-a+1]
CIlow<-xp[a]

cat(' CI (',CIlow,', ',CIup,')','\n')
hist(xp)

}

The estimate would be 30.56664 and this is the result of the bootstrap method : CI ( 30.0623 , 31.08694 ) The below is the histogram of the distribution of 95th percentile of sample percentiles acquired from bootstrap method.

And this is the method you also suggested using binomial distribution.

 #2 using bionomial
    up=qbinom(0.975,7000,0.95)
    low=qbinom(0.025,7000,0.95)
    x_sorted[up]
    x_sorted[low]

The result is quite similar :

> x_sorted[up]
[1] 31.08901
> x_sorted[low]
[1] 30.04189

As someone may have noticed from my English, I am not a native English speaker and even learning English. So It would be appreciated if someone correct my grammar.

Best Answer

Related Solutions

Solved – Can you extrapolate values of the dependent variable with a GAM

Solved – Using bootstrap to estimate the 95th percentile and confidence interval for skewed data

Related Question