I'll (try to) answer this in three steps: first, let's identify exactly what we mean by a univariate smooth. Next, we will describe a multivariate smooth (specifically, a smooth of two variables). Finally, I'll make my best attempt at describing a tensor product smooth.
1) Univariate smooth
Let's say we have some response data $y$ that we conjecture is an unknown function $f$ of a predictor variable $x$ plus some error $ε$. The model would be:
$$y=f(x)+ε$$
Now, in order to fit this model, we have to identify the functional form of $f$. The way we do this is by identifying basis functions, which are superposed in order to represent the function $f$ in its entirety. A very simple example is a linear regression, in which the basis functions are just $β_2x$ and $β_1$, the intercept. Applying the basis expansion, we have
$$y=β_1+β_2x+ε$$
In matrix form, we would have:
$$Y=Xβ+ε$$
Where $Y$ is an n-by-1 column vector, $X$ is an n-by-2 model matrix, $β$ is a 2-by-1 column vector of model coefficients, and $ε$ is an n-by-1 column vector of errors. $X$ has two columns because there are two terms in our basis expansion: the linear term and the intercept.
The same principle applies for basis expansion in MGCV, although the basis functions are much more sophisticated. Specifically, individual basis functions need not be defined over the full domain of the independent variable $x$. Such is often the case when using knot-based bases (see "knot based example"). The model is then represented as the sum of the basis functions, each of which is evaluated at every value of the independent variable. However, as I mentioned, some of these basis functions take on a value of zero outside of a given interval and thus do not contribute to the basis expansion outside of that interval. As an example, consider a cubic spline basis in which each basis function is symmetric about a different value (knot) of the independent variable -- in other words, every basis function looks the same but is just shifted along the axis of the independent variable (this is an oversimplification, as any practical basis will also include an intercept and a linear term, but hopefully you get the idea).
To be explicit, a basis expansion of dimension $i-2$ could look like:
$$y=β_1+β_2x+β_3f_1(x)+β_4f_2(x)+...+β_if_{i-2} (x)+ε$$
where each function $f$ is, perhaps, a cubic function of the independent variable $x$.
The matrix equation $Y=Xβ+ε$ can still be used to represent our model. The only difference is that $X$ is now an n-by-i matrix; that is, it has a column for every term in the basis expansion (including the intercept and linear term). Since the process of basis expansion has allowed us to represent the model in the form of a matrix equation, we can use linear least squares to fit the model and find the coefficients $β$.
This is an example of unpenalized regression, and one of the main strengths of MGCV is its smoothness estimation via a penalty matrix and smoothing parameter. In other words, instead of:
$$β=(X^TX)^{-1}X^TY$$
we have:
$$β=(X^TX+λS)^{-1}X^TY$$
where $S$ is a quadratic $i$-by-$i$ penalty matrix and $λ$ is a scalar smoothing parameter. I will not go into the specification of the penalty matrix here, but it should suffice to say that for any given basis expansion of some independent variable and definition of a quadratic "wiggliness" penalty (for example, a second-derivative penalty), one can calculate the penalty matrix $S$.
MGCV can use various means of estimating the optimal smoothing parameter $λ$. I will not go into that subject since my goal here was to give a broad overview of how a univariate smooth is constructed, which I believe I have done.
2) Multivariate smooth
The above explanation can be generalized to multiple dimensions. Let's go back to our model that gives the response $y$ as a function $f$ of predictors $x$ and $z$. The restriction to two independent variables will prevent cluttering the explanation with arcane notation. The model is then:
$$y=f(x,z)+ε$$
Now, it should be intuitively obvious that we are going to represent $f(x,z)$ with a basis expansion (that is, a superposition of basis functions) just like we did in the univariate case of $f(x)$ above. It should also be obvious that at least one, and almost certainly many more, of these basis functions must be functions of both $x$ and $z$ (if this was not the case, then implicitly $f$ would be separable such that $f(x,z)=f_x(x)+f_z(z)$). A visual illustration of a multidimensional spline basis can be found here. A full two dimensional basis expansion of dimension $i-3$ could look something like:
$$y=β_1+β_2x+β_3z+β_4f_1(x,z)+...+β_if_{i-3} (x,z)+ε$$
I think it's pretty clear that we can still represent this in matrix form with:
$$Y=Xβ+ε$$
by simply evaluating each basis function at every unique combination of $x$ and $z$. The solution is still:
$$β=(X^TX)^{-1}X^TY$$
Computing the second derivative penalty matrix is very much the same as in the univariate case, except that instead of integrating the second derivative of each basis function with respect to a single variable, we integrate the sum of all second derivatives (including partials) with respect to all independent variables. The details of the foregoing are not especially important: the point is that we can still construct penalty matrix $S$ and use the same method to get the optimal value of smoothing parameter $λ$, and given that smoothing parameter, the vector of coefficients is still:
$$β=(X^TX+λS)^{-1}X^TY$$
Now, this two-dimensional smooth has an isotropic penalty: this means that a single value of $λ$ applies in both directions. This works fine when both $x$ and $z$ are on approximately the same scale, such as a spatial application. But what if we replace spatial variable $z$ with temporal variable $t$? The units of $t$ may be much larger or smaller than the units of $x$, and this can throw off the integration of our second derivatives because some of those derivatives will contribute disproportionately to the overall integration (for example, if we measure $t$ in nanoseconds and $x$ in light years, the integral of the second derivative with respect to $t$ may be vastly larger than the integral of the second derivative with respect to $x$, and thus "wiggliness" along the $x$ direction may go largely unpenalized). Slide 15 of the "smooth toolbox" I linked has more detail on this topic.
It is worth noting that we did not decompose the basis functions into marginal bases of $x$ and $z$. The implication here is that multivariate smooths must be constructed from bases supporting multiple variables. Tensor product smooths support construction of multivariate bases from univariate marginal bases, as I explain below.
3) Tensor product smooths
Tensor product smooths address the issue of modeling responses to interactions of multiple inputs with different units. Let's suppose we have a response $y$ that is a function $f$ of spatial variable $x$ and temporal variable $t$. Our model is then:
$$y=f(x,t)+ε$$
What we'd like to do is construct a two-dimensional basis for the variables $x$ and $t$. This will be a lot easier if we can represent $f$ as:
$$f(x,t)=f_x(x)f_t(t)$$
In an algebraic / analytical sense, this is not necessarily possible. But remember, we are discretizing the domains of $x$ and $t$ (imagine a two-dimensional "lattice" defined by the locations of knots on the $x$ and $t$ axes) such that the "true" function $f$ is represented by the superposition of basis functions. Just as we assumed that a very complex univariate function may be approximated by a simple cubic function on a specific interval of its domain, we may assume that the non-separable function $f(x,t)$ may be approximated by the product of simpler functions $f_x(x)$ and $f_t(t)$ on an interval—provided that our choice of basis dimensions makes those intervals sufficiently small!
Our basis expansion, given an $i$-dimensional basis in $x$ and $j$-dimensional basis in $t$, would then look like:
\begin{align}
y = &β_{1} + β_{2}x + β_{3}f_{x1}(x)+β_{4}f_{x2}(x)+...+ \\
&β_{i}f_{x(i-3)}(x)+ β_{i+1}t + β_{i+2}tx + β_{i+3}tf_{x1}(x)+β_{i+4}tf_{x2}(x)+...+ \\
&β_{2i}tf_{x(i-3)}(x)+ β_{2i+1}f_{t1}(t) + β_{2i+2}f_{t1}(t)x + β_{2i+3}f_{t1}(t)f_{x1}(x)+β_{i+4}f_{t1}(t)f_{x2}(x){\small +...+} \\
&β_{2i}f_{t1}(t)f_{x(i-3)}(x)+\ldots+ \\
&β_{ij}f_{t(j-3)}(t)f_{x(i-3)}(x) + ε
\end{align}
Which may be interpreted as a tensor product. Imagine that we evaluated each basis function in $x$ and $t$, thereby constructing n-by-i and n-by-j model matrices $X$ and $T$, respectively. We could then compute the $n^2$-by-$ij$ tensor product $X \otimes T$ of these two model matrices and reorganize into columns, such that each column represented a unique combination $ij$. Recall that the marginal model matrices had $i$ and $j$ columns, respectively. These values correspond to their respective basis dimensions. Our new two-variable basis should then have dimension $ij$, and therefore the same number of columns in its model matrix.
NOTE: I'd like to point out that since we explicitly constructed the tensor product basis functions by taking products of marginal basis functions, tensor product bases may be constructed from marginal bases of any type. They need not support more than one variable, unlike the multivariate smooth discussed above.
In reality, this process results in an overall basis expansion of dimension $ij-i-j+1$ because the full multiplication includes multiplying every $t$ basis function by the x-intercept $β_{x1}$ (so we subtract $j$) as well as multiplying every $x$ basis function by the t-intercept $β_{t1}$ (so we subtract $i$), but we must add the intercept back in by itself (so we add 1). This is known as applying an identifiability constraint.
So we can represent this as:
$$y=β_1+β_2x+β_3t+β_4f_1(x,t)+β_5f_2(x,t)+...+β_{ij-i-j+1}f_{ij-i-j-2}(x,t)+ε$$
Where each of the multivariate basis functions $f$ is the product of a pair of marginal $x$ and $t$ basis functions. Again, it's pretty clear having constructed this basis that we can still represent this with the matrix equation:
$$Y=Xβ+ε$$
Which (still) has the solution:
$$β=(X^TX)^{-1}X^TY$$
Where the model matrix $X$ has $ij-i-j+1$ columns. As for the penalty matrices $J_x$ and $J_t$, these are are constructed separately for each independent variable as follows:
$$J_x=β^T I_j \otimes S_x β$$
and,
$$J_t=β^T S_t \otimes I_i β$$
This allows for an overall anisotropic (different in each direction) penalty (Note: the penalties on the second derivative of $x$ are added up at each knot on the $t$ axis, and vice versa). The smoothing parameters $λ_x$ and $λ_t$ may now be estimated in much the same way as the single smoothing parameter was for the univariate and multivariate smooths. The result is that the overall shape of a tensor product smooth is invariant to rescaling of its independent variables.
I recommend reading all the vignettes on the MGCV website, as well as "Generalized Additive Models: and introduction with R." Long live Simon Wood.
Best Answer
Q1
There are a couple of key practical situations where the decomposed model using separate marginal smooths plus a special tensor-product interaction, or
ti()
, smooth is a useful approach to take.The first situation is one where testing for the presence of an interaction is important to the problem being addressed with the GAM. If we fit the bivariate smooth model
$$y_i \sim \mathcal{D}(\mu_i, \theta)$$ $$g(\mu_i) = \beta_0 + f(x_i, z_i)$$
representing $f(x_, z_i)$ via a tensor product smooth using
te()
in mgcv then, despite estimating the bivariate smooth interaction model we wanted, we have no model term that explicitly represents the interaction that we want to test. One might assume that we can use a generalized likelihood ratio test to compare the model above with the simpler model$$g(\mu_i) = \beta_0 + f_1(x_i) + f_2(z_i)$$
but one has to take great care to ensure that these two models are strictly nested for that GLRT approach to be valid.
To get the correct nesting we can decompose the full bivariate smooth tensor product smooth into separate marginal smooths plus a tensor product smooth whose basis expansion has had the main effects of the separate marginal smooths removed from it.
Our model then becomes
$$g(\mu_i) = \beta_0 + f_1(x_i) + f_2(z_i) + f_3^{*}(x_i, z_i)$$
where the superscript $*$ is just there to remind us that $f_3^{*}(x_i, z_i)$ uses a different basis expansion to that of $f(x_i, z_i)$ from the original model.
Now you could compare this decomposed model with the simpler model with two univariate smooths and the proper nesting required for the GLRT, or one could use the Wald-like tests of the null hypothesis that $f_3() = 0$.
A practical example of when this might be needed in my own work is modelling monthly or daily air temperature time series. These data have a seasonal component plus a long-term trend and it is of scientific interest to ask whether the air temperatures are
The first option would amount to fitting the simpler univariate smooth model and focusing inference on the smooth of year:
$$g(\mu_i) = \beta_0 + f_1(\text{year}_i) + f_2(\text{month}_i),$$
whilst the second option would imply that $f_3^{*}(\text{year}_i,\text{month}_i)$ in
$$g(\mu_i) = \beta_0 + f_1(\text{year}_i) + f_2(\text{month}_i) + f_3^{*}(\text{year}_i,\text{month}_i)$$
is non-zero.
Another situation where the tensor product interaction smooth is useful is when you wish to include several smooth effects in your model, and only some of them are required to interact or be part of bivariate relationships. Say we have four covariates that we will indicate by $x_i$, $z_i$, $v_i$ and $w_i$. Further assume we want to fit a model where $\mathbf{x}$ interacts with $\mathbf{z}$ and $\mathbf{v}$ whilst $\mathbf{z}$ interacts with $\mathbf{v}$ and $\mathbf{w}$, we could fit the following model
$$g(\mu_i) = \beta_0 + f_1(x_i, z_i) + f_1(x_i, v_i) + f_1(z_i, v_i) + f_1(z_i, w_i)$$
but we should note that the main effects of all of the covariates *except that of $\mathbf{w}$ occur in the bases of two or more smooth functions. Despite the identifiability constraints that mgcv imposes, it is likely that problems will occur in fitting the model with these overlapping terms.
We can largely avoid these issues through the use of the tensor product interaction smooths and the decomposed parameterisation of the model:
$$g(\mu_i) = \beta_0 + f_1(x_i) + f_2(z_i) + f_3(v_i) + f_4(w_i) + f_5^{*}(x_i, z_i) + f_6^{*}(x_i, v_i) + f_7^{*}(z_i, v_i) + f_8^{*}(z_i, w_i)$$
where the main effects of each covariate is now separated out from the interaction smooths.
How one approaches the modelling of a specific data set will depend on the specifics of the problem being tackled. Even if you are not explicitly interested in testing the interaction effect, once you start estimating multiple bivariate (or higher) smooth functions in moderately complex settings, the decomposed version of the model can become necessary to avoid computational issues.
Note that the decomposed model is not a free lunch; the
te(x,y)
smooth would require the selection of two smoothness parameters, whereas the decomposed versions(x) + s(z) + ti(x,z)
would require the selection of four smoothness parameters, one each for the two marginal smooths plus two for the tensor product. Hence, if you don't need to fit the decomposition, you can fit a simpler model by using the full tensor product rather than the decomposed parameterization.Q2
The interpretation doesn't appear to be quite so simple as $f_1(x_i)$ being the smooth effect of $\mathbf{x}$ averaged over the values of $\mathbf{z}$ when in the decomposed model.
The figure below shows estimates of $f_1(x_i)$ where the true relationship with $\mathbf{y}$ involves a bivariate effects of $\mathbf{x}$ and $\mathbf{z}$. The models are
(full code is shown below). Here $f_1(x_i)$ is
s(x)
in all three models. In the first we ignore the effect ofz
, in the second we include the additive effect ofz
but not the interaction, while in the third model we include the main effects and interaction orx
andz
ony
.If $f_1(x_i)$ were simply the smooth effect of
x
averaged overz
then we would expect all three estimated functions to be effectively the same. We do see that the smooth functions ofx
in models 1 and 2 are effectively equivalent, but the smooth ofx
in model 3 is noticeably smoother than those of the other two models. The differences between the three estimated functions are quite small indeed, and almost indistinguishable when one plots them with the data.In hindsight we might have expected $f_1(x_i)$ to be smoother in model 3 than in either of models 1 and 2, because some of the variation in
x
is accounted for by the variation inz
and the bivariate smooth relationship between the two variables.Another way to think about this is to consider what the average function of
x
might be $\bar{f}_1(x_i)$ over all possible values for $z_i$. My maths skills aren't good enough to work this out, but some manipulation of them3
model fitted above and some averaging suggests that the function $f_1(x_i)$ is close to be not the same as $\bar{f}_1(x_i)$ might be (estimated from fits to data below).What I'm showing here is:
x
by construction,z
of the fitted values of the smooth for each unique value ofx
. In other words it is the average of the rainbow-coloured lines (slices through the surface) at each unique point onx
(but for all 250 of them, not just the 10 shown), andm3
, the decomposed model (in other words the "main effect" smooth forx
).The differences at the ends of the range of
x
are on the order of 0.01 to 0.03, which while not large aren't zero and don't seem to be converging towards 0 as I increased the number of locations in $z$ that I averaged the surface over.Q3
How do we interpret the tensor product interaction smooth $f_3(x,z)$?
I think the simplest way to think about the interaction smooth is as the smooth deformations you need to apply to $f_1(x)$ as you vary $z$, or conversely to $f_2(z)$ as you vary $x$.
The plots below attempt to illustrate this. The first panel shows the additive ("main") effects $f_1(x) + f_2(z)$. The second panel shows the deformations you would apply to that surface to yield the final panel, which is the full estimated value of $y$ given the additive effects of all three smooth functions.
It is worth noting that in this example, if we estimated the bivariate effect using
te()
we get subtly different model fits, and this is most likely due to the extra smoothness parameters in the decomposed model compared with the simplerte()
model.Code