Generalized Additive Models – How to Use s() or te() with Interactions in GAMs using MGCV

generalized-additive-modelinteractionmgcvsplinestensor

I am trying to model CO2 fluxes (fco2) using a number of environmental parameters using a GAM in mgcv.

Specifically, I have leaf temperature (tl), vapour pressure deficit (vpd), and soil water content (swc). vpd is a function of tl, air pressure and relative humidity (both not measured). I getting the best model response when I have a 3-way interaction between them, but also a relatively good one with an interaction between tl and vpd. Now I'm wondering about the following:

I am getting a lower AIC using s(), rather than te() in m1 or m2 below. Which one is correct and why is this?

m1 <- gam(fco2 ~ s(tl) + s(vpd) + s(swc) + ti(tl, vpd), data=df, method='REML')
m2 <- gam(fco2 ~ te(tl) + te(vpd) + te(swc) + ti(tl, vpd), data=df, method='REML')

Since vpd is a function of tl (among other things), should one of the two variables be removed? This significantly increases AIC though and lowers R².

Thanks a lot for the help

Best Answer

The difference isn't really because you use s() or te(), but more because those two function generate smooth with different default bases. s() generates a low-rank thin plate spline basis, whereas te() uses a cubic regression spline basis for the marginal smooths.

The correct way to express this model is using the s(x, bs = "cr") + s(z, bs = "cr") + ti(x,z) form. Although Simon allow's the ti() form (ti(x) + ti(z) + ti(x,z)) I find it a bit odd to create a tensor product interaction of a single variable. I know it works and gives effectively the same model as the s() form but I find the ti() version weird. The main plus in favour of the ti() form is that you are sure to get the right bases; with s() you need to specify the cr basis for example if you want to match the ti(), so it is something else that you need to think about and get right.
If vpd is a function of tl, then it is likely that your model suffers from a problem called concurvity, which is where a term in the model can be represented by smooth combination of the other terms in the model. You can check this with the concurvity() function in {mgcv}. You don't have to remove tl; it sounds like it is not just a simple function of vpd so you might be OK leaving it in and accepting any high concvurvity (if present).

1) Univariate smooth

Let's say we have some response data $y$ that we conjecture is an unknown function $f$ of a predictor variable $x$ plus some error $ε$. The model would be:

$$y=f(x)+ε$$

Now, in order to fit this model, we have to identify the functional form of $f$. The way we do this is by identifying basis functions, which are superposed in order to represent the function $f$ in its entirety. A very simple example is a linear regression, in which the basis functions are just $β_2x$ and $β_1$, the intercept. Applying the basis expansion, we have

$$y=β_1+β_2x+ε$$

In matrix form, we would have:

$$Y=Xβ+ε$$

Where $Y$ is an n-by-1 column vector, $X$ is an n-by-2 model matrix, $β$ is a 2-by-1 column vector of model coefficients, and $ε$ is an n-by-1 column vector of errors. $X$ has two columns because there are two terms in our basis expansion: the linear term and the intercept.

The same principle applies for basis expansion in MGCV, although the basis functions are much more sophisticated. Specifically, individual basis functions need not be defined over the full domain of the independent variable $x$. Such is often the case when using knot-based bases (see "knot based example"). The model is then represented as the sum of the basis functions, each of which is evaluated at every value of the independent variable. However, as I mentioned, some of these basis functions take on a value of zero outside of a given interval and thus do not contribute to the basis expansion outside of that interval. As an example, consider a cubic spline basis in which each basis function is symmetric about a different value (knot) of the independent variable -- in other words, every basis function looks the same but is just shifted along the axis of the independent variable (this is an oversimplification, as any practical basis will also include an intercept and a linear term, but hopefully you get the idea).

To be explicit, a basis expansion of dimension $i-2$ could look like:

$$y=β_1+β_2x+β_3f_1(x)+β_4f_2(x)+...+β_if_{i-2} (x)+ε$$

where each function $f$ is, perhaps, a cubic function of the independent variable $x$.

The matrix equation $Y=Xβ+ε$ can still be used to represent our model. The only difference is that $X$ is now an n-by-i matrix; that is, it has a column for every term in the basis expansion (including the intercept and linear term). Since the process of basis expansion has allowed us to represent the model in the form of a matrix equation, we can use linear least squares to fit the model and find the coefficients $β$.

This is an example of unpenalized regression, and one of the main strengths of MGCV is its smoothness estimation via a penalty matrix and smoothing parameter. In other words, instead of:

$$β=(X^TX)^{-1}X^TY$$

we have:

$$β=(X^TX+λS)^{-1}X^TY$$

where $S$ is a quadratic $i$-by-$i$ penalty matrix and $λ$ is a scalar smoothing parameter. I will not go into the specification of the penalty matrix here, but it should suffice to say that for any given basis expansion of some independent variable and definition of a quadratic "wiggliness" penalty (for example, a second-derivative penalty), one can calculate the penalty matrix $S$.

MGCV can use various means of estimating the optimal smoothing parameter $λ$. I will not go into that subject since my goal here was to give a broad overview of how a univariate smooth is constructed, which I believe I have done.

2) Multivariate smooth

The above explanation can be generalized to multiple dimensions. Let's go back to our model that gives the response $y$ as a function $f$ of predictors $x$ and $z$. The restriction to two independent variables will prevent cluttering the explanation with arcane notation. The model is then:

$$y=f(x,z)+ε$$

Now, it should be intuitively obvious that we are going to represent $f(x,z)$ with a basis expansion (that is, a superposition of basis functions) just like we did in the univariate case of $f(x)$ above. It should also be obvious that at least one, and almost certainly many more, of these basis functions must be functions of both $x$ and $z$ (if this was not the case, then implicitly $f$ would be separable such that $f(x,z)=f_x(x)+f_z(z)$). A visual illustration of a multidimensional spline basis can be found here. A full two dimensional basis expansion of dimension $i-3$ could look something like:

$$y=β_1+β_2x+β_3z+β_4f_1(x,z)+...+β_if_{i-3} (x,z)+ε$$

I think it's pretty clear that we can still represent this in matrix form with:

$$Y=Xβ+ε$$

by simply evaluating each basis function at every unique combination of $x$ and $z$. The solution is still:

$$β=(X^TX)^{-1}X^TY$$

Computing the second derivative penalty matrix is very much the same as in the univariate case, except that instead of integrating the second derivative of each basis function with respect to a single variable, we integrate the sum of all second derivatives (including partials) with respect to all independent variables. The details of the foregoing are not especially important: the point is that we can still construct penalty matrix $S$ and use the same method to get the optimal value of smoothing parameter $λ$, and given that smoothing parameter, the vector of coefficients is still:

$$β=(X^TX+λS)^{-1}X^TY$$

Now, this two-dimensional smooth has an isotropic penalty: this means that a single value of $λ$ applies in both directions. This works fine when both $x$ and $z$ are on approximately the same scale, such as a spatial application. But what if we replace spatial variable $z$ with temporal variable $t$? The units of $t$ may be much larger or smaller than the units of $x$, and this can throw off the integration of our second derivatives because some of those derivatives will contribute disproportionately to the overall integration (for example, if we measure $t$ in nanoseconds and $x$ in light years, the integral of the second derivative with respect to $t$ may be vastly larger than the integral of the second derivative with respect to $x$, and thus "wiggliness" along the $x$ direction may go largely unpenalized). Slide 15 of the "smooth toolbox" I linked has more detail on this topic.

It is worth noting that we did not decompose the basis functions into marginal bases of $x$ and $z$. The implication here is that multivariate smooths must be constructed from bases supporting multiple variables. Tensor product smooths support construction of multivariate bases from univariate marginal bases, as I explain below.

3) Tensor product smooths

Tensor product smooths address the issue of modeling responses to interactions of multiple inputs with different units. Let's suppose we have a response $y$ that is a function $f$ of spatial variable $x$ and temporal variable $t$. Our model is then:

$$y=f(x,t)+ε$$

What we'd like to do is construct a two-dimensional basis for the variables $x$ and $t$. This will be a lot easier if we can represent $f$ as:

$$f(x,t)=f_x(x)f_t(t)$$

In an algebraic / analytical sense, this is not necessarily possible. But remember, we are discretizing the domains of $x$ and $t$ (imagine a two-dimensional "lattice" defined by the locations of knots on the $x$ and $t$ axes) such that the "true" function $f$ is represented by the superposition of basis functions. Just as we assumed that a very complex univariate function may be approximated by a simple cubic function on a specific interval of its domain, we may assume that the non-separable function $f(x,t)$ may be approximated by the product of simpler functions $f_x(x)$ and $f_t(t)$ on an interval—provided that our choice of basis dimensions makes those intervals sufficiently small!

Our basis expansion, given an $i$-dimensional basis in $x$ and $j$-dimensional basis in $t$, would then look like:

\begin{align} y = &β_{1} + β_{2}x + β_{3}f_{x1}(x)+β_{4}f_{x2}(x)+...+ \\ &β_{i}f_{x(i-3)}(x)+ β_{i+1}t + β_{i+2}tx + β_{i+3}tf_{x1}(x)+β_{i+4}tf_{x2}(x)+...+ \\ &β_{2i}tf_{x(i-3)}(x)+ β_{2i+1}f_{t1}(t) + β_{2i+2}f_{t1}(t)x + β_{2i+3}f_{t1}(t)f_{x1}(x)+β_{i+4}f_{t1}(t)f_{x2}(x){\small +...+} \\ &β_{2i}f_{t1}(t)f_{x(i-3)}(x)+\ldots+ \\ &β_{ij}f_{t(j-3)}(t)f_{x(i-3)}(x) + ε \end{align}

Which may be interpreted as a tensor product. Imagine that we evaluated each basis function in $x$ and $t$, thereby constructing n-by-i and n-by-j model matrices $X$ and $T$, respectively. We could then compute the $n^2$-by-$ij$ tensor product $X \otimes T$ of these two model matrices and reorganize into columns, such that each column represented a unique combination $ij$. Recall that the marginal model matrices had $i$ and $j$ columns, respectively. These values correspond to their respective basis dimensions. Our new two-variable basis should then have dimension $ij$, and therefore the same number of columns in its model matrix.

NOTE: I'd like to point out that since we explicitly constructed the tensor product basis functions by taking products of marginal basis functions, tensor product bases may be constructed from marginal bases of any type. They need not support more than one variable, unlike the multivariate smooth discussed above.

In reality, this process results in an overall basis expansion of dimension $ij-i-j+1$ because the full multiplication includes multiplying every $t$ basis function by the x-intercept $β_{x1}$ (so we subtract $j$) as well as multiplying every $x$ basis function by the t-intercept $β_{t1}$ (so we subtract $i$), but we must add the intercept back in by itself (so we add 1). This is known as applying an identifiability constraint.

So we can represent this as:

$$y=β_1+β_2x+β_3t+β_4f_1(x,t)+β_5f_2(x,t)+...+β_{ij-i-j+1}f_{ij-i-j-2}(x,t)+ε$$

Where each of the multivariate basis functions $f$ is the product of a pair of marginal $x$ and $t$ basis functions. Again, it's pretty clear having constructed this basis that we can still represent this with the matrix equation:

$$Y=Xβ+ε$$

Which (still) has the solution:

$$β=(X^TX)^{-1}X^TY$$

Where the model matrix $X$ has $ij-i-j+1$ columns. As for the penalty matrices $J_x$ and $J_t$, these are are constructed separately for each independent variable as follows:

$$J_x=β^T I_j \otimes S_x β$$

and,

$$J_t=β^T S_t \otimes I_i β$$

This allows for an overall anisotropic (different in each direction) penalty (Note: the penalties on the second derivative of $x$ are added up at each knot on the $t$ axis, and vice versa). The smoothing parameters $λ_x$ and $λ_t$ may now be estimated in much the same way as the single smoothing parameter was for the univariate and multivariate smooths. The result is that the overall shape of a tensor product smooth is invariant to rescaling of its independent variables.

I recommend reading all the vignettes on the MGCV website, as well as "Generalized Additive Models: and introduction with R." Long live Simon Wood.

Generalized Additive Model – Variable and Model Selection Techniques

When fitting multiple models you can use AIC, but you have to appreciate what metric the two models are being assessed on. With AIC, which is an approximation to the leave-one-out cross validation error, the metric by which you are comparing models is one of their predictive ability. Using this as your metric, you'd be adding/removing variables with a confidence level of ~0.16 (instead of the "usual" 0.05, not that this usual value is a good option).

Once you've chosen your model with AIC, you then have the problem that any inference you might do is messed up by all the model selection you did. This is the point @Frank Harrell makes in his comment. If you only care about prediction then you might not be doing any post-selection inference (looking at p values, plotting smooths with the credible intervals, etc), but if you are going to do that inference, selection via AIC for other type of stepwise selection) is going to mess things up until we have a good theory for post-selection inference (which is an active area of research; though I've not seen anything for GAMs as yet)

select=TRUE can only be used on a single (final) model AFTER using AIC, as a form of variable selection. Am I mistaken here?

Yes; you are mistaken. One would use select = TRUE instead of doing model selection by AIC, largely because the consequences of doing the selection via extra shrinkage penalties can be accounted for in the model summary() output/tests, where is can't (yet) be accounted for if you have done AIC-based selection.

From the example code you show, it seems there is interest in testing if there are combinations of one or more smooth-by-smooth or smooth-by-factor interactions. Unless I was solely interested in creating a prediction tool, I would not be comparing all those models via AIC.

Furthermore, it seems like the models are exploring different hypothesis about the temporal aspects of the data set; does the effect of x vary smoothly over the years? or does the effect of time (year) differ between levels of factor_B.

None of these seem like something I'd want to compare using AIC (unless I was purely trying to find a model for prediction).

Personally, if I thought there was an interaction between the two factor variables, I'd fit that interaction and then consider the estimates of the effect size of that interaction. I wouldn't decide to exclude the interaction on the basis of a p-value (and AIC selection is using p values, just with a confidence level of ~0.16 for terms differing by 1 DF), not least because that is a very strong statement that the interaction effect is 0, which is very unlikely.

The choice of smooth terms doesn't make sense to me; perhaps it does in the context of the scientific problem you are working on? The s(z) term only comes up in one of the models for example, why is that? It might just be that your instructor is getting you to do something that isn't "best practice" but to show you what can happen when you use those "not best-practice" techniques. I.e. to make a point pedagogically. Without more context on why those candidate models were chosen, it's hard to comment further.

As to your discoveries:

The Workshop presentation you link to in the comments focuses too much (IMHO) on AIC as the selection metric.
select = TRUE can be used with AIC, but using select = TRUE kind of renders the whole model-selection-by-AIC thing redundant as the extra penalties on the smooths are doing selection for you. That doesn't mean AIC computed on the model is wrong; you could still compare it with a model that didn't have the extra shrinkage applied for example.
select = TRUE isn't only applicable when you have one model in mind. You might have two or a few well-selected models in mind and you can apply the extra shrinkage to all those models. But yes, it is typical that you have a full model (something that contains a well-thought-out selection of covariates/terms, not simply every variable you have in your data set.)
The AIC that is used in mgcv is a special AIC that is suited to the situation where we have smoothness parameter selection going on. This AIC uses a special penalty term replacing the $2k$ penalty from common-or-garden AIC. I am not aware of similar results for BIC, so I would use BIC with care with GAMs.

Best Answer

Related Solutions

GAMs in R – Understanding Tensor Product Interactions in MGCV Package

1) Univariate smooth

2) Multivariate smooth

3) Tensor product smooths

Generalized Additive Model – Variable and Model Selection Techniques

Related Question