Solved – Simulating Responses from fitted Generalized Additive Model

generalized-additive-modelmgcvsimulation

I have fit a GAM using the mgcv package in R. My goal is to simulate the posterior distribution of the response variable for new subjects (say, 1000 draws each). I understand how to use the predict function to predict expected value on the response scale, but I am not familiar with how to model the variation in expected response. My goal is to model the distribution of possible responses. If my question is not clear, I'm trying to follow the "simulation" half of the answer here

Best Answer

I'll illustrate with the classic 4 term data set oft used to illustrate GAMs, but will only simulate data from the strongly nonlinear term $f(x_2)$ as it is easy to visualise the process with a single covariate.

library('mgcv')
set.seed(20)

f2 <- function(x) 0.2 * x^11 * (10 * (1 - x))^6 + 10 * (10 * x)^3 * (1 - x)^10
ysim <- function(n = 500, scale = 2) {
  x <- runif(n)
  e <- rnorm(n, 0, scale)
  f <- f2(x)
  y <- f + e
  data.frame(y = y, x2 = x, f2 = f)
}

df <- ysim()
head(df)

Fit the model:

m <- gam(y ~ s(x2), data = df)

Now, simulate 50 new data points from the posterior distribution of the response conditional upon the estimated spline coefficients and smoothness parameters.

set.seed(10)
nsim <- 50
drawx <- function(x, n) runif(n, min = min(x), max = max(x))
newx <- data.frame(x2 = drawx(df[, 'x2'], n = nsim))

Here I'm drawing 50 values uniformly at random over the range of $x_2$.

The simulated values are given by

$$y \sim \mathcal{N}(\hat{y}, \hat{\sigma}^2) $$

where $\hat{y}$ is

$$\hat{y} = \alpha + s(x_2)$$

and $\hat{\sigma}$ is the residual standard error, which is generally estimated from the residuals of a Gaussian model rather than modelled directly (although the latter is certainly possible). We get $\hat{y}$ via predict().

First grab the estimated residual variance from the model

sig2 <- m$sig2

Next predict for the 50 new locations

set.seed(10)
newx <- transform(newx, newy = predict(m, newx, type = "response"))

and then, using these values of $\hat{y}$ in newy, simulate 50 new values from a Gaussian with mean given by newy and standard deviation (sqrt of sig2)

newx <- transform(newx, ysim = rnorm(nsim, mean = newy, sd = sqrt(sig2)))

Notice that each simulated value is a random draw from a Gaussian distribution with mean value equal to the model estimated value and variance $\hat{\sigma}^2$.

We should check what we've done. First, generate more values from the fitted spline so we can see what model was estimated

pred <- with(df, data.frame(x2 = seq(min(x2), max(x2), length = 500)))
pred <- transform(pred, fitted = predict(m, newdata = pred,
                                         type = "response"))

Now we can put this all together

library('ggplot2')
theme_set(theme_bw())

ggplot(df) +
  geom_point(aes(x = x2, y = y), colour = "grey") +
  geom_line(aes(x = x2, y = f2), colour = "forestgreen", size = 1.3) +
  geom_line(aes(x = x2, y = fitted), data = pred, colour = "blue") +
  geom_point(aes(x = x2, y = newy), data = newx, colour = "blue", size = 2) +
  geom_point(aes(x = x2, y = ysim), data = newx, colour = "red",
             size = 2)

enter image description here

The thick green line is the true function $f(x_2)$. The grey dots are the original data to which the GAM was fitted. The blue line is the fitted smooth, the estimate $\hat{f}(x_2)$. The blue points are the fitted values for the 50 locations at which we simulated from the model. The red points are the random draws from the posterior distribution of $y$ conditional upon the estimated model.

This approach extends to multiple covariates — it's just not so easy to visualize what's going on. This also extends to non-Gaussian responses, such as Poisson or binomial-distributed response.

The key points are:

  1. Predict from the model using predict(), to get the expectation of the response

    • (this is the mean in the case of the Gaussian, but is the expected count say for the Poisson)
  2. Predict or extract values for any additional parameters needed for the conditional distribution of the response

    • (for the Gaussian, as shown here, there is a second parameter $\sigma^2$, which is needed to fully specify the conditional distribution of the response. For the Poisson, for example, there is only the "mean", the expected count.)
  3. Simulate from the conditional distribution of the response

    • (here we could use rnorm() directly because the parameters of the distribution estimated via the GLM/GAM were in exactly the same form as that expected by rnorm(). This is also the case for the Poisson and Binomial distributions. For other distributions this may not be case. For example the parameters of a GLM/GAM with a Gamma family fitted via glm() and gam() require some translation before you can plug them into the rgamma() function.
Related Question