Solved – Can you extrapolate values of the dependent variable with a GAM

extrapolationgeneralized linear modelgeneralized-additive-modelmethod-comparison

I'm trying to find issues where GLMs are better than GAMs and came to the idea that GLMs can make predictions beyond the scope of the data used to feed the model (i.e, extrapolations), while GAMs cannot:

Suppose we have a set of X and Y observations. The X observations are spread inside the domain [x0, x1]. If we fit a GLM to X vs Y we obtain a mathematical relation between X and Y (in the most simple case, Y = b0*X + b1). Therefore, we can obtain for every X_i of our choice a modelled Y_i. We surely should have a good estimate if X_i is inside [x0,x1] but nothing speaks about giving a try also for values outside this range (another story is that the estimate is "good").

Now, GAMs are based on smooth functions obtained from the X-Y scatter, but they give no (simple) mathematical relation between X and Y. You get an Y estimate for each X observation you have and can make a nice plot. Surely you can interpolate any Y value between observations to obtain an estimate of your choice, but considering we have only X data inside the range [x0, x1] you cannot predict (or extrapolate) a Y value with a GAM for an X value lying outside the range [x0, x1]. With no mathematical relation linking X and Y, you cannot extrapolate!

So, if I understand correctly and the answer to my question is "no", I would say the extrapolation or predicting potential of a GLM is surely a very strong advantage in comparison to a GAM!

Best Answer

Four years have passed by and now I'm able of answering my own question. This is indeed a no, you cannot really extrapolate data with a GAM, only in a very limited range quite close to x0 or x1. If you are using splines, you would be extrapolating with a cubic polynomial, which is not very good, since the curve would quickly tend to -infinity or +infinity. There is an example in Simon Wood's book "Generalized Additive Models: An Introduction with R" (exercise 5 in page 400) about this, where it is shown that the extrapolation capacity of a GAM is very limited. I believe indeed that GLMs should be better to extrapolate data.

Related Solutions

Solved – Are LOESS and GAM with one covariate the same

Not really a full answer, but too long for a comment: s sets up a spline, whereas loess does a local regression.

In the gam package (maybe mgcv too, not too familiar with that one) you can also feed a local regression, as in

library(gam)

set.seed(1234) 

# generate data
x <- sort(runif(100)) 
y <- sin(2*pi*x) + rnorm(10, sd=0.1) 

gam.1 <- gam(y ~ lo(x))
base.r <- loess(y ~ x) 
summary(base.r$fitted - gam.1$fitted)
plot(base.r$fitted,gam.1$fitted)

That does not produce the same fitted values either, but maybe you can further play around with the settings of lo and loess.

Solved – use bootstrapping to estimate the uncertainty in a maximum value of a GAM

An alternative approach that can be used for GAMs fitted using Simon Wood's mgcv software for R is to do posterior inference from the fitted GAM for the feature of interest. Essentially, this involves simulating from the posterior distribution of the parameters of the fitted model, predicting values of the response over a fine grid of $x$ locations, finding the $x$ where the fitted curve takes its maximal value, repeat for lots of simulated models and compute a confidence for the location of the optima as the quantiles of the distribution of optima from the simulated models.

The meat from what I present below was cribbed from page 4 of Simon Wood's course notes (pdf)

To have something akin to a biomass example, I'm going to simulate a single species' abundance along a single gradient using my coenocliner package.

library("coenocliner")
A0    <- 9 * 10 # max abundance
m     <- 45     # location on gradient of modal abundance
r     <- 6 * 10 # species range of occurence on gradient
alpha <- 1.5    # shape parameter
gamma <- 0.5    # shape parameter
locs  <- 1:100  # gradient locations
pars  <- list(m = m, r = r, alpha = alpha,
              gamma = gamma, A0 = A0) # species parameters, in list form
set.seed(1)
mu <- coenocline(locs, responseModel = "beta", params = pars, expectation = FALSE)

Fit the GAM

library("mgcv")
m <- gam(mu ~ s(locs), method = "REML", family = "poisson")

... predict on a fine grid over the range of $x$ (locs)...

p  <- data.frame(locs = seq(1, 100, length = 5000))
pp <- predict(m, newdata = p, type = "response")

and visualise the fitted function and the data

plot(mu ~ locs)
lines(pp ~ locs, data = p, col = "red")

This produces

The 5000 prediction locations is probably overkill here and certainly for the plot, but depending on the fitted function in your use-case, you might need a fine grid to get close to the maximum of the fitted curve.

Now we can simulate from the posterior of the model. First we get the $Xp$ matrix; the matrix that, once multiplied by model coefficients yields predictions from the model at new locations p

Xp <- predict(m, p, type="lpmatrix") ## map coefs to fitted curves

Next we collect the fitted model coefficients and their (Bayesian) covariance matrix

beta <- coef(m)
Vb   <- vcov(m) ## posterior mean and cov of coefs

The coefficients are a multivariate normal with mean vector beta and covariance matrix Vb. Hence we can simulate from this multivariate normal new coefficients for models consistent with the fitted one but which explore the uncertainty in the fitted model. Here we generate 10000 (n)` simulated models

n <- 10000
library("MASS") ## for mvrnorm
set.seed(10)
mrand <- mvrnorm(n, beta, Vb) ## simulate n rep coef vectors from posterior

Now we can generate predictions for of the n simulated models, transform from the scale of the linear predictor to the response scale by applying the inverse of the link function (ilink()) and then compute the $x$ value (value of p$locs) at the maximal point of the fitted curve

opt <- rep(NA, n)
ilink <- family(m)$linkinv
for (i in seq_len(n)) { 
  pred   <- ilink(Xp %*% mrand[i, ])
  opt[i] <- p$locs[which.max(pred)]
}

Now we compute the confidence interval for the optima using probability quantiles of the distribution of 10,000 optima, one per simulated model

ci <- quantile(opt, c(.025,.975)) ## get 95% CI

For this example we have:

> ci
    2.5%    97.5% 
39.06321 52.39128

We can add this information to the earlier plot:

plot(mu ~ locs)
abline(v = p$locs[which.max(pp)], lty = "dashed", col = "grey")
lines(pp ~ locs, data = p, col = "red")
lines(y = rep(0,2), x = ci, col = "blue")
points(y = 0, x = p$locs[which.max(pp)], pch = 16, col = "blue")

which produces

As we'd expect given the data/observations, the interval on the fitted optima is quite asymmetric.

Slide 5 of Simon's course notes suggests why this approach might be preferred to bootstrapping. Advantages of posterior simulation are that it is quick - bootstrapping GAMs is slow. Two additional issues with bootstrapping are (taken from Simon's notes!)

For parametric bootstrapping the smoothing bias causes problems, the model simulated from is biased and the fits to the samples will be yet more biased.
For non-parametric ‘case-resampling’ the presence of replicate copies of the same data causes undersmoothing, especially with GCV based smoothness selection.

It should be noted that the posterior simulation performed here is conditional upon the chosen smoothness parameters for the the model/spline. This can be accounted for, but Simon's notes suggest this makes little difference if you actually go to the trouble of doing it. (so I haven't here...)

Best Answer

Related Solutions

Solved – Are LOESS and GAM with one covariate the same

Solved – use bootstrapping to estimate the uncertainty in a maximum value of a GAM

Related Question