Solved – Creating marginal plots when using mgcv (GAM) package in R

generalized-additive-modelmgcvr

Before training the GAM model I log-transformed several variables. The default plotting method plot.gam in mgcv can provide nice visualization of the marginal effect of a variable but seems not be able to transform the variable back to the original scale. Is there any easy way to do this?

If not, how can we make the marginal plots by hand? I seem to have figured out how to do this with one predictor. However, in the multiple predictor case, if we scan the variable of interest, what fixed values should we use for the rest of variables? Shall we take the mean for all other variables? What if the rest of variables have some categorical variables where the mean is not clear?

Best Answer

First, make sure that transforming your predictor (or response) variables is actually a reasonable thing to do (see, e.g., here).

Second, you'll have to do it manually if you mean back-transforming a predictor variable to an original scale. If you're talking about the response variable, read this again, and if you still decided to transform the response variable see the trans and possibly shift arguments in ?plot.gam.

But let's assume you're referring to back-transforming a predictor variable. For a marginal plot, setting the other continuous variables at their mean is a popular choice, but that need not be the case. If a predictor variable was highly skewed, for example, you could easily use the median as an option. Or a specific value relevant to the context. You'll have to think carefully about what makes the most sense in your application. Note also that using the mean of a log-transformed variable in your marginal plots puts it at the geometric mean on the original scale.

The same thought process applies to categorical variables --- you could produce the marginal plot for the reference class (as I illustrate below), another class that may be more relevant, or even all classes.

Below we use mgcv's simulation capabilities to generate a data set with 3 continuous predictors (x1, x2, and x3) and a categorical predictor (x0). For simplicity, imagine that x1 and x2 have already been log-transformed, so we're interested in returning them to their original scale in our marginal plots. Since we (pretended to) log-transform x1 and x2, it's a simple matter of exponentiating the values prior to plotting.

library(mgcv)
set.seed(123)

# Assume normal response
# Assume x1 and x2 have been log-transformed
xformed_data <- gamSim(5) 
mod <- gam(y ~ x0 + s(x1) + s(x2) + s(x3), data = xformed_data) 
plot(mod, pages = 1) 

# Marginal plots (transformed and original scale) for x1
# Generate new data set with categorical variable at reference class (i.e., x0 = "1")
# This marginal plot will illustrate the marginal relationship between x1 and y with
# x0 at the refence level, x2 at its geometric mean, and x3 at its mean

x1dat <- data.frame(x0 = factor("1"), 
                    x1 = seq(min(xformed_data$x1), max(xformed_data$x1), length.out = 100),
                    x2 = mean(xformed_data$x2), 
                    x3 = mean(xformed_data$x3))

x1fit <- predict(mod, newdata = x1dat, type = "response", se = TRUE) 

# Marginal plot for x1 on transformed scale; will match output from plot(mod)
# although confidence intervals will not match because uncertainty from other
# smooths and intercept now included
plot(y ~ x1, xformed_data) 
lines(x1dat$x1, x1fit$fit) 
lines(x1dat$x1, x1fit$fit +2 * x1fit$se.fit, lty=2, col = "red") 
lines(x1dat$x1, x1fit$fit -2 * x1fit$se.fit, lty=2, col = "red") 

# Marginal plot for x1 on original scale; will match output from plot(mod)
plot(y ~ exp(x1), xformed_data) 
lines(exp(x1dat$x1), x1fit$fit) 
lines(exp(x1dat$x1), x1fit$fit +2 * x1fit$se.fit, lty=2, col = "red") 
lines(exp(x1dat$x1), x1fit$fit -2 * x1fit$se.fit, lty=2, col = "red") 

# Marginal plots (transformed and original scale) for x2
# Generate new data set with categorical variable at reference class (i.e., x0 = "1")
# This marginal plot will illustrate the marginal relationship between x2 and y with
# x0 at the refence level, x1 at its geometric mean, and x3 at its mean

x2dat <- data.frame(x0 = factor("1"), 
                    x1 = mean(xformed_data$x1),
                    x2 = seq(min(xformed_data$x2), max(xformed_data$x2), length.out = 100), 
                    x3 = mean(xformed_data$x3))

x2fit <- predict(mod, newdata = x2dat, type = "response", se = TRUE) 

# Marginal plot for x2 on transformed scale; will match output from plot(mod)
# although confidence intervals will not match because uncertainty from other
# smooths and intercept now included
plot(y ~ x2, xformed_data) 
lines(x2dat$x2, x2fit$fit) 
lines(x2dat$x2, x2fit$fit +2 * x2fit$se.fit, lty=2, col = "red") 
lines(x2dat$x2, x2fit$fit -2 * x2fit$se.fit, lty=2, col = "red") 

# Marginal plot for x2 on original scale; will match output from plot(mod)
plot(y ~ exp(x2), xformed_data) 
lines(exp(x2dat$x2), x2fit$fit) 
lines(exp(x2dat$x2), x2fit$fit +2 * x2fit$se.fit, lty=2, col = "red") 
lines(exp(x2dat$x2), x2fit$fit -2 * x2fit$se.fit, lty=2, col = "red")

Related Solutions

Solved – How to tune smoothing in mgcv GAM model

The k argument effectively sets up the dimensionality of the smoothing matrix for each term. gam() is using a GCV or UBRE score to select an optimal amount of smoothness, but it can only work within the dimensionality of the smoothing matrix. By default, te() smooths have k = 5^2 for 2d surfaces. I forget what it is for s() so check the documents. The current advice from Simon Wood, author of mgcv, is that if the degree of smoothness selected by the model is at or close to the limit of the dimensionality imposed by the value used for k, you should increase k and refit the model to see if a more complex model is selected from the higher dimensional smoothing matrix.

However, I don't know how locfit works, but you do need to have something the stops you from fitting too complex a surface (GCV and UBRE, or (RE)ML if you choose to use them [you can't as you set scale = -1], are trying to do just that), that is not supported by the data. In other words, you could fit very local features of the data but are you fitting the noise in the sample of data you collected or are you fitting the mean of the probability distribution? gam() may be telling you something about what can be estimated from your data, assuming that you've sorted out the basis dimensionality (above).

Another thing to look at is that the smoothers you are currently using are global in the sense that the smoothness selected is applied over the entire range of the smooth. Adaptive smoothers can spend the allotted smoothness "allowance" in parts of the data where the response is changing rapidly. gam() has capabilities for using adaptive smoothers.

See ?smooth.terms and ?adaptive.smooth to see what can be fitted using gam(). te() can combine most if not all of these smoothers (check the docs for which can and can't be included in tensor products) so you could use an adaptive smoother basis to try to capture the finer local scale in the parts of the data where the response is varying quickly.

I should add, that you can get R to estimate a model with a fixed set of degrees of freedom used by a smooth term, using the fx = TRUE argument to s() and te(). Basically, set k to be what you want and fx = TRUE and gam() will just fit a regression spline of fixed degrees of freedom not a penalised regression spline.

Solved – Calculating a risk ratio for specific x values from a GAM model using the mgcv package

This doesn't exactly answer your question, but it might still solve your problem of needing to calculate risk ratios. The epiR package allows you to calculate risk ratios.

I could not get your example to work (see my comment to your question), so here is an example from the package's documentation:

library(epiR) # Used for Risk ratio
library(MASS) # Used for data

dat1 <- birthwt; head(dat1)

## Generate a table of cell frequencies. First set the levels of the outcome
## and the exposure so the frequencies in the 2 by 2 table come out in the
## conventional format:
dat1$low <- factor(dat1$low, levels = c(1,0))
dat1$smoke <- factor(dat1$smoke, levels = c(1,0))
dat1$race <- factor(dat1$race, levels = c(1,2,3))
## Generate the 2 by 2 table. Exposure (rows) = smoke. Outcome (columns) = low.
tab1 <- table(dat1$smoke, dat1$low, dnn = c("Smoke", "Low BW"))
print(tab1)
## Compute the incidence risk ratio and other measures of association:
epi.2by2(dat = tab1, method = "cohort.count", 
conf.level = 0.95, units = 100, outcome = "as.columns")

Best Answer

Related Solutions

Solved – How to tune smoothing in mgcv GAM model

Solved – Calculating a risk ratio for specific x values from a GAM model using the mgcv package

Related Question