GAM Predictions – How to Compare Effects of Predictors Between GAMs

effect-sizegeneralized-additive-modelmgcv

I have two models, which share some predictors. I'd like to compare the magnitude of their effects on the respective response variable.

Here's an example based on data and code taken from here.

Data:

library(mgcv)

forest <- read.table(url("https://raw.githubusercontent.com/eric-pedersen/mgcv-esa-workshop/master/data/forest-health/beech.raw"),
                     header = TRUE)

forest <- transform(forest, id = factor(formatC(id, width = 2, flag = "0")))

## Aggregate defoliation & convert categorical vars to factors
levs <- c("low","med","high")
forest <- transform(forest,
                    aggDefol = as.numeric(cut(defol, breaks = c(-1,10,45,101),
                                              labels = levs)),
                    watermoisture = factor(watermoisture),
                    alkali = factor(alkali),
                    humus = cut(humus, breaks = c(-0.5, 0.5, 1.5, 2.5, 3.5),
                                labels = 1:4),
                    type = factor(type),
                    fert = factor(fert))
forest <- droplevels(na.omit(forest))
forest$ph <- as.numeric(forest$ph)

ctrl <- gam.control(nthreads = 3)

Models:

forest.m1 <- gam(aggDefol ~ s(age) + ph + watermoisture,
                     data = forest, 
                     family = ocat(R = 3), 
                     method = "REML",
                     control = ctrl)

forest.m2 <- gam(canopyd ~ s(age) + ph + watermoisture + aggDefol,
                     data = forest, 
                     method = "REML",
                     control = ctrl)

How can I determine if e. g. the effect of s(age) and watermoisture is stronger on aggDefol than canopyd?

Best Answer

Comparing whether the effect of variable $x_1$ is stronger on a response variable $y_1$ than another $y_2$ is reasonably straightforward if certain conditions are met such that the question is well-defined.

Generally speaking, the coefficient $\beta_1$ gives us an idea about the estimated average change in response per unit of change of $x_1$. Now, assuming $y_1$ and $y_2$ are on the same scale, we are comparing changes to $y_1$ apples to $y_2$ apples, so we are good.

Thus the first issue here is that $y_1$ is an ordinal categorical variable and $y_2$ is a Gaussian; on face value, this is "not great" but not all is lost. Luckily, we have an identity link for our ordinal model m_1 meaning that our linear predictor is "just" a latent variable that is then segmented based on the estimated transition thresholds. That means that we can squint our eyes and say that the beta's from m_1 and m_2 refer to the same scale, just for m_1 the scale is the latent scale and in m_2 the response scale. So the first thing to do here is to rescale aggDefol and canopyd to have the same scale. This is is going to be a quirky thing to do as I strongly suspect that changing canopyd to be in $[1,3]$ is likely necessary because R encodes order categorical variables as integers. But theoretically can be done. Let me note if we simply use defol and canopyd directly as numeric values normalised to be $N(0,1)$, our task would be greatly simplified and probably more coherent. We are using splines for age and ph after all so if there are strong non-linearities due to them we should still be able to account for them.

The second issue is that aggDefol is a response variable in model m_1 but one of the explanatory variables in m2. That makes the effect of age and watermoisture in m_2 compounded by any information we have in m_1 by aggDefol. Simply put, we should't do it as it is complicated and rife with possible bias due to confounding and/or mediating effects. But let's say we throw caution to the wind. We could use the estimated response of defol from m_1 as a feature in m_2, not the raw values. As such, our comparison on "whether the effect of age on defol is greater than the one on canopyd" becomes the comparison of $\beta_{\text{age}}^{m_1}$ against the combined effect of age in canopyd which will be $\beta_{\text{age}}^{m_2} + \beta_{\text{age}}^{m_1}$. It is very messy, but assuming that canopyd and defol are on the same scale it "works". It will be also a mess to do when using a spline basis so I would strongly suggest using a polynomial basis so it is clear what comparisons are done.

To recap, I strong suggest reformulating the question. First of all, use similar response scales so it is obvious what they represent. Second, avoid using the same variable $x$ as a response variable and as an explanatory variable; if we cannot help it, we can use the predicted variable from the one model instead of the raw variables because in that way we effectively allow information from variable $x$ to come in the second model only through the other variables (as the response from m_1 will be simply the linear combination of the explanatory variables of it).

OK and some code:

forest$defol_norm = scale(forest$defol)
forest$canopyd_norm = scale(forest$canopyd)
forest.m1 <- gam(defol_norm ~ poly(age,2) + s(ph) + watermoisture,
                 data = forest, method = "REML")
forest$pred_defol = predict(forest.m1)
forest.m2 <- gam(canopyd_norm ~  poly(age,2)  + s(ph) + watermoisture + pred_defol,
                 data = forest, method = "REML")
# Get the effects expected by "watermoisture2", perform similar analysis for other vars
abs(forest.m1 |> coefficients()|> getElement("watermoisture2"))
abs(forest.m2 |> coefficients()|> getElement("watermoisture2") + 
    forest.m1 |> coefficients()|> getElement("watermoisture2"))

suggesting that both age and watermoisture have a much more pronounced effect on (normalised) canopyd rather than on (normalised) defol. Also, note that this methodology only focused on $\beta$'s. As all $p$-values suggested reasonable levels of statistical significance across all the explanatory variables examined, there is the working assumption that "all variables are equally important but have different effect sizes".