Q1 What's the difference between models 3 and 4?
Model 3 is a purely additive model
$$y = \alpha + f_1(x) + f_2(w) + \varepsilon$$
so we have a constant $\alpha$ plus the smooth effect of $x$ plus the smooth effect of $w$.
Model 4 is a smooth interaction of two continuous variables
$$y = \alpha + f_1(x, w) + \varepsilon$$
In a practical sense, Model 3 says that no matter what the effect of $w$, the effect of $x$ on the response is the same; if we fix $x$ at some known value and vary $w$ over some range, the contributions from $f_1(x)$ to the fitted model remains the same. Verify this if you want by predict()
-ing from model 3 for a fixed value of x
and a few different values of w
and use the type = 'terms'
argument of the predict()
method. You'll see a constant contribution to the fitted/predicted values for s(x)
.
This is not the case of model 4; this model says that the smooth effect of $x$ varies smoothly with the value of $w$ and vice-versa.
Note that unless $x$ and $w$ are in the same units or we expect the same wiggliness in both variables, you should be using te()
to fit the interaction.
m4a <- gam(y ~ te(x, w), data = data.ex, method = 'REML')
pdata <- mutate(data.ex, Fittedm4a = predict(m4a))
ggplot(pdata, aes(x = x, y = y)) +
geom_point() +
geom_line(aes(y = Fittedm4a), col = 'red')
In one sense, model 4 is fitting
$$y = \alpha + f_1(x) + f_2(w) + f_3(x, w) + \varepsilon$$
where $f_3$ is the pure smooth interaction of the "main" smooth effects of $x$ and $w$, and which have been removed, for sake of identifiability, from the basis of $f_3$. You can get this model via
m4b <- gam(y ~ ti(x) + ti(w) + ti(x, w), data = data.ex, method = 'REML')
but note this estimates 4 smoothness parameters:
- the one associated with the main smooth effect of $x$
- the one associated with the main smooth effect of $w$
- the one associated with the marginal smooth of $x$ in the interaction tensor product smooth
- the one associated with the marginal smooth of $w$ in the interaction tensor product smooth
The te()
model contains just two smoothness parameters, one per marginal basis.
A fundamental problem with all these models is that the effect of $w$ is not strictly smooth; there's a discontinuity where the effect of $w$ falls to 0 (or 1 in w2
). This is showing up in your plots (and the one I show in detail here).
Q2 What, exactly, is "by" doing here?
by
variable smooths can do a number of different things depending on what you pass to the by
argument. In your examples the by
variable, $w$ is continuous. In that case what you get a varying coefficient model. This is a model in which the linear effect of $w$ varies smoothly with $x$. In equation terms this is what your, say model 5, is doing
$$y = \alpha + f_1(x)w + \varepsilon$$
If this isn't immediately clear (it wasn't to me when I first looked at these models) for some given values of $x$ we evaluate the smooth function at this value and this then become the equivalent of $\beta_1 w$; in other words, it is the linear effect of $w$ at the given value of $x$, and those linear effects vary smoothly with $x$. See section 7.5.3 in the second edition of Simon's book for a concrete example where the linear effect of a covariate varies a smooth function of space (lat and long).
Q3 Why would there be such a notable difference between Models 5 and 9?
The difference between models 5 and 9 I think is simply due to multiplying by 0 or multiplying by 1. With the former, the effect of the only term in the model $f_1(x)w$ is 0 because $f_1(x) \times 0 = 0$. In model 9 you have $f_1(x) \times 1 = f_1(x)$ in those areas where there is no contribution from the gaussians of $w$. As $f_1(x)$ is an ~ exponential function, you get this superimposed upon the overall effect of $w$.
In other words, model 5 contains zero trend everywhere $w$ is 0, but model 9 includes an ~ exponential trend everywhere $w$ is 0 (1), upon which the varying coefficient effect of $w$ is superimposed.
Best Answer
The idea here is that by estimating the trend as a smooth function, the residuals then are a stationary process and the ARMA model is being estimated in the residuals. In other words, the estimated smoother is detrending the data and the residuals are subjected to an ARMA model. Of course it doesn't happen in a two-step process like this, but hopefully it is clearer why we don't require stationarity when modelling time series using GAM(M)s.
Another issue you'll potentially need to grapple with is identifiability of the trend and the correlation structure. If we have a high degree of autocorrelation in the observed series, then we could model this as:
These two models are very similar to one another and absent any information you can give to separate the two (say by fixing the degrees of freedom of the spline or setting the parameters of the correlation structure in the residuals), then the data may not contain enough information to uniquely identify the separate trend and autocorrelation processes.
The whole reason behind my use of GAMs in my research is that I'm interested in estimating the trends in my time series and those trends are in general non-linear. The GAM allows me to estimate the thing I want. In classical time series modelling, the interest is in modelling data as stochastic trends using lagged versions of the response and / or current and lagged versions of a white noise process. This is of less interest in my work, but is clearly of broad interest in others.
I've written a paper on modelling time series using GAMs, which hopefully explains some of the approach. It's written for palaeoecologists but applies to any univariate time series, and Open Access so free to read and supported by full R code in the Supplements.