GAM Predictions – How to Compare Effects of Predictors Between GAMs

effect-sizegeneralized-additive-modelmgcv

I have two models, which share some predictors. I'd like to compare the magnitude of their effects on the respective response variable.

Here's an example based on data and code taken from here.

Data:

library(mgcv)

forest <- read.table(url("https://raw.githubusercontent.com/eric-pedersen/mgcv-esa-workshop/master/data/forest-health/beech.raw"),
                     header = TRUE)

forest <- transform(forest, id = factor(formatC(id, width = 2, flag = "0")))

## Aggregate defoliation & convert categorical vars to factors
levs <- c("low","med","high")
forest <- transform(forest,
                    aggDefol = as.numeric(cut(defol, breaks = c(-1,10,45,101),
                                              labels = levs)),
                    watermoisture = factor(watermoisture),
                    alkali = factor(alkali),
                    humus = cut(humus, breaks = c(-0.5, 0.5, 1.5, 2.5, 3.5),
                                labels = 1:4),
                    type = factor(type),
                    fert = factor(fert))
forest <- droplevels(na.omit(forest))
forest$ph <- as.numeric(forest$ph)

ctrl <- gam.control(nthreads = 3)

Models:

forest.m1 <- gam(aggDefol ~ s(age) + ph + watermoisture,
                     data = forest, 
                     family = ocat(R = 3), 
                     method = "REML",
                     control = ctrl)

forest.m2 <- gam(canopyd ~ s(age) + ph + watermoisture + aggDefol,
                     data = forest, 
                     method = "REML",
                     control = ctrl)

How can I determine if e. g. the effect of s(age) and watermoisture is stronger on aggDefol than canopyd?

Best Answer

Comparing whether the effect of variable $x_1$ is stronger on a response variable $y_1$ than another $y_2$ is reasonably straightforward if certain conditions are met such that the question is well-defined.

Generally speaking, the coefficient $\beta_1$ gives us an idea about the estimated average change in response per unit of change of $x_1$. Now, assuming $y_1$ and $y_2$ are on the same scale, we are comparing changes to $y_1$ apples to $y_2$ apples, so we are good.

Thus the first issue here is that $y_1$ is an ordinal categorical variable and $y_2$ is a Gaussian; on face value, this is "not great" but not all is lost. Luckily, we have an identity link for our ordinal model m_1 meaning that our linear predictor is "just" a latent variable that is then segmented based on the estimated transition thresholds. That means that we can squint our eyes and say that the beta's from m_1 and m_2 refer to the same scale, just for m_1 the scale is the latent scale and in m_2 the response scale. So the first thing to do here is to rescale aggDefol and canopyd to have the same scale. This is is going to be a quirky thing to do as I strongly suspect that changing canopyd to be in $[1,3]$ is likely necessary because R encodes order categorical variables as integers. But theoretically can be done. Let me note if we simply use defol and canopyd directly as numeric values normalised to be $N(0,1)$, our task would be greatly simplified and probably more coherent. We are using splines for age and ph after all so if there are strong non-linearities due to them we should still be able to account for them.

The second issue is that aggDefol is a response variable in model m_1 but one of the explanatory variables in m2. That makes the effect of age and watermoisture in m_2 compounded by any information we have in m_1 by aggDefol. Simply put, we should't do it as it is complicated and rife with possible bias due to confounding and/or mediating effects. But let's say we throw caution to the wind. We could use the estimated response of defol from m_1 as a feature in m_2, not the raw values. As such, our comparison on "whether the effect of age on defol is greater than the one on canopyd" becomes the comparison of $\beta_{\text{age}}^{m_1}$ against the combined effect of age in canopyd which will be $\beta_{\text{age}}^{m_2} + \beta_{\text{age}}^{m_1}$. It is very messy, but assuming that canopyd and defol are on the same scale it "works". It will be also a mess to do when using a spline basis so I would strongly suggest using a polynomial basis so it is clear what comparisons are done.

To recap, I strong suggest reformulating the question. First of all, use similar response scales so it is obvious what they represent. Second, avoid using the same variable $x$ as a response variable and as an explanatory variable; if we cannot help it, we can use the predicted variable from the one model instead of the raw variables because in that way we effectively allow information from variable $x$ to come in the second model only through the other variables (as the response from m_1 will be simply the linear combination of the explanatory variables of it).

OK and some code:

forest$defol_norm = scale(forest$defol)
forest$canopyd_norm = scale(forest$canopyd)
forest.m1 <- gam(defol_norm ~ poly(age,2) + s(ph) + watermoisture,
                 data = forest, method = "REML")
forest$pred_defol = predict(forest.m1)
forest.m2 <- gam(canopyd_norm ~  poly(age,2)  + s(ph) + watermoisture + pred_defol,
                 data = forest, method = "REML")
# Get the effects expected by "watermoisture2", perform similar analysis for other vars
abs(forest.m1 |> coefficients()|> getElement("watermoisture2"))
abs(forest.m2 |> coefficients()|> getElement("watermoisture2") + 
    forest.m1 |> coefficients()|> getElement("watermoisture2"))

suggesting that both age and watermoisture have a much more pronounced effect on (normalised) canopyd rather than on (normalised) defol. Also, note that this methodology only focused on $\beta$'s. As all $p$-values suggested reasonable levels of statistical significance across all the explanatory variables examined, there is the working assumption that "all variables are equally important but have different effect sizes".

Related Solutions

Solved – GAMs with many slightly correlated predictors

You need to consider more than just linear correlations among covariates when working with GAMs. You need to consider nonlinear correlations or dependencies. This is known as concurvity in the smoothing/GAM context.

I would argue also that your model selection process is also wrong; I would fit the full model you consider and include some for of shrinkage/regularization to perform selection. I would also consider measures of concurvity for the input data when evaluating the model.

In the penalized GLM approach to GAMs, for example as implemented in the R package mgcv by Simon Wood, a smoothness penalty term that penalizes wiggliness of the fitted smooth as a component of the model penalized likelihood that is maximised during fitting. That smoothness penalty can force a wiggly term back to a linear function. However, the way this penalty works, it cannot act on the infinitely smooth parts of the spline basis expansion. In other words it can't shrink beyond a linear term.

One approach that solves this problem is that of Marra and Wood (2011) — they place a second penalty on the perfectly smooth parts of the basis (the penalty null space). With both penalties in place, the wiggly parts of the basis (the range space) and the perfectly smooth parts of the basis (the null space) can both be subject to shrinkage. The potential effect is that, if warranted a spline's contribution to the fitted model can be shrunk to become effectively, but not quite, zero.

Your approach is making an explicit statement that the effects of the other covariates are exactly equal to zero.

Using a more principled approach to feature selection in GAMs will help avoid you estimated smooths changing too much between candidate models.

You can read a little more about concurvity here in the help page for the R function mgcv::concurvity().

Marra, G. & Wood, S. N. Practical variable selection for generalized additive models. Comput. Stat. Data Anal. 55, 2372–2387 (2011).

GAMs Plotting – Plotting Generalized Additive Models on Response Scale with Multiple Smooth and Linear Terms

If the model contains z then the effect of x estimated by the model is that given z is in the model. Hence the fitted response is the additive sum of the two effects, and we can't talk generally about the estimated values of the response for a range of values of x without also stating the value of z.

For Gaussian models, you can just add on the intercept in plot.gam() to shift around the smooth curve on the y-axis. See argument shift to plot.gam(). This assumes that as per the example x and z are unrelated in the model, and furthermore some value of z (in this case I think 0 as it is a linear term not subject to identifiability constraints).

A more general solution is just to predict from the model at a grid of values over x while holding z constant at some representative value, say its mean or median.

Here's a full example of doing this by hand:

library("mgcv")
library("ggplot2")

set.seed(1)
df <- gamSim()
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df, method = "REML")

new_data <- with(df, expand.grid(x2 = seq(min(x2), max(x2), length = 200),
                                 x0 = median(x0),
                                 x1 = median(x1),
                                 x3 = median(x3)))

ilink <- family(m)$linkinv
pred <- predict(m, new_data, type = "link", se.fit = TRUE)
pred <- cbind(pred, new_data)
pred <- transform(pred, lwr_ci = ilink(fit - (2 * se.fit)),
                        upr_ci = ilink(fit + (2 * se.fit)),
                        fitted = ilink(fit))

ggplot(pred, aes(x = x2, y = fitted)) +
  geom_ribbon(aes(ymin = lwr_ci, ymax = upr_ci), alpha = 0.2) +
  geom_line()

producing

That script should be fine for any of the standard family options in mgcv, but you'll have to take careful note of what predict() returns for some of the fancier families in mgcv.

Best Answer

Related Solutions

Solved – GAMs with many slightly correlated predictors

GAMs Plotting – Plotting Generalized Additive Models on Response Scale with Multiple Smooth and Linear Terms

Related Question