I have two models, which share some predictors. I'd like to compare the magnitude of their effects on the respective response variable.
Here's an example based on data and code taken from here.
Data:
library(mgcv)
forest <- read.table(url("https://raw.githubusercontent.com/eric-pedersen/mgcv-esa-workshop/master/data/forest-health/beech.raw"),
header = TRUE)
forest <- transform(forest, id = factor(formatC(id, width = 2, flag = "0")))
## Aggregate defoliation & convert categorical vars to factors
levs <- c("low","med","high")
forest <- transform(forest,
aggDefol = as.numeric(cut(defol, breaks = c(-1,10,45,101),
labels = levs)),
watermoisture = factor(watermoisture),
alkali = factor(alkali),
humus = cut(humus, breaks = c(-0.5, 0.5, 1.5, 2.5, 3.5),
labels = 1:4),
type = factor(type),
fert = factor(fert))
forest <- droplevels(na.omit(forest))
forest$ph <- as.numeric(forest$ph)
ctrl <- gam.control(nthreads = 3)
Models:
forest.m1 <- gam(aggDefol ~ s(age) + ph + watermoisture,
data = forest,
family = ocat(R = 3),
method = "REML",
control = ctrl)
forest.m2 <- gam(canopyd ~ s(age) + ph + watermoisture + aggDefol,
data = forest,
method = "REML",
control = ctrl)
How can I determine if e. g. the effect of s(age)
and watermoisture
is stronger on aggDefol
than canopyd
?
Best Answer
Comparing whether the effect of variable $x_1$ is stronger on a response variable $y_1$ than another $y_2$ is reasonably straightforward if certain conditions are met such that the question is well-defined.
Generally speaking, the coefficient $\beta_1$ gives us an idea about the estimated average change in response per unit of change of $x_1$. Now, assuming $y_1$ and $y_2$ are on the same scale, we are comparing changes to $y_1$ apples to $y_2$ apples, so we are good.
Thus the first issue here is that $y_1$ is an ordinal categorical variable and $y_2$ is a Gaussian; on face value, this is "not great" but not all is lost. Luckily, we have an identity link for our ordinal model
m_1
meaning that our linear predictor is "just" a latent variable that is then segmented based on the estimated transition thresholds. That means that we can squint our eyes and say that the beta's fromm_1
andm_2
refer to the same scale, just form_1
the scale is the latent scale and inm_2
the response scale. So the first thing to do here is to rescaleaggDefol
andcanopyd
to have the same scale. This is is going to be a quirky thing to do as I strongly suspect that changingcanopyd
to be in $[1,3]$ is likely necessary because R encodes order categorical variables as integers. But theoretically can be done. Let me note if we simply usedefol
andcanopyd
directly as numeric values normalised to be $N(0,1)$, our task would be greatly simplified and probably more coherent. We are using splines forage
andph
after all so if there are strong non-linearities due to them we should still be able to account for them.The second issue is that
aggDefol
is a response variable in modelm_1
but one of the explanatory variables inm2
. That makes the effect ofage
andwatermoisture
inm_2
compounded by any information we have inm_1
byaggDefol
. Simply put, we should't do it as it is complicated and rife with possible bias due to confounding and/or mediating effects. But let's say we throw caution to the wind. We could use the estimated response ofdefol
fromm_1
as a feature inm_2
, not the raw values. As such, our comparison on "whether the effect ofage
ondefol
is greater than the one oncanopyd
" becomes the comparison of $\beta_{\text{age}}^{m_1}$ against the combined effect ofage
incanopyd
which will be $\beta_{\text{age}}^{m_2} + \beta_{\text{age}}^{m_1}$. It is very messy, but assuming thatcanopyd
anddefol
are on the same scale it "works". It will be also a mess to do when using a spline basis so I would strongly suggest using a polynomial basis so it is clear what comparisons are done.To recap, I strong suggest reformulating the question. First of all, use similar response scales so it is obvious what they represent. Second, avoid using the same variable $x$ as a response variable and as an explanatory variable; if we cannot help it, we can use the predicted variable from the one model instead of the raw variables because in that way we effectively allow information from variable $x$ to come in the second model only through the other variables (as the response from
m_1
will be simply the linear combination of the explanatory variables of it).OK and some code:
suggesting that both
age
andwatermoisture
have a much more pronounced effect on (normalised)canopyd
rather than on (normalised)defol
. Also, note that this methodology only focused on $\beta$'s. As all $p$-values suggested reasonable levels of statistical significance across all the explanatory variables examined, there is the working assumption that "all variables are equally important but have different effect sizes".