The following question builds on the discussion found on this page.
Given a response variable y
, a continuous explanatory variable x
and a factor fac
, it is possible to define a General Additive Model (GAM) with an interaction between x
and fac
using the argument by=
. According to the help file ?gam.models
in the R package mgcv
, this can be accomplished as follows:
gam1 <- gam(y ~ fac +s(x, by = fac), ...)
@GavinSimpson here suggests a different approach:
gam2 <- gam(y ~ fac +s(x) +s(x, by = fac, m=1), ...)
I have been playing around with a third model:
gam3 <- gam(y ~ s(x, by = fac), ...)
My main questions are: are some of these models just wrong, or are they simply different? In the latter case, what are their differences?
Based on the example that I am going to discuss below I think I could understand some of their differences, but I am still missing something.
As an example I am going to use a dataset with color spectra for flowers of two different plant species measured at different locations.
rm(list=ls())
# install.packages("RCurl")
library(RCurl) # allows accessing data from URL
df <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt"))
library(mgcv)
These are the mean color spectra at the locality level for the two species (rolling means were used):
Each color refers to a different species. Each line refers to a different locality.
My final goal is to model the (potentially interactive) effect of Taxon
and wavelength wl
on % reflectance (referred to as density
in the code and dataset) while accounting for Locality
as a random effect in a mixed-effect GAM. For the moment I won't add the mixed effect part to my plate, which is already full enough with trying to understand how to model interactions.
I'll start with the simplest of the three interactive GAMs:
gam.interaction0 <- gam(density ~ s(wl, by = Taxon), data = df)
# common intercept, different slopes
plot(gam.interaction0, pages=1)
summary(gam.interaction0)
Produces:
Family: gaussian
Link function: identity
Formula:
density ~ s(wl, by = Taxon)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.3490 0.1693 167.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(wl):TaxonSpeciesA 8.938 8.999 884.3 <2e-16 ***
s(wl):TaxonSpeciesB 8.838 8.992 325.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.523 Deviance explained = 52.4%
GCV = 284.96 Scale est. = 284.42 n = 9918
The parametric part is the same for both species, but different splines are fitted for each species. It is a bit confusing to have a parametric part in the summary of GAMs, which are non-parametric. @IsabellaGhement explains:
If you look at the plots of the estimated smooth effects (smooths)
corresponding to your first model, you will notice that they are
centered about zero. So you need to 'shift' those smooths up (if the
estimated intercept is positive) or down (if the estimated intercept
is negative) to obtain the smooth functions you thought you were
estimating. In other words, you need to add the estimated intercept to
the smooths to get at what you really want. For your first model, the
'shift' is assumed to be the same for both smooths.
Moving on:
gam.interaction1 <- gam(density ~ Taxon +s(wl, by = Taxon, m=1), data = df)
plot(gam.interaction1,pages=1)
summary(gam.interaction1)
Gives:
Family: gaussian
Link function: identity
Formula:
density ~ Taxon + s(wl, by = Taxon, m = 1)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.3132 0.1482 272.0 <2e-16 ***
TaxonSpeciesB -26.0221 0.2186 -119.1 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(wl):TaxonSpeciesA 7.978 8 2390 <2e-16 ***
s(wl):TaxonSpeciesB 7.965 8 879 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.803 Deviance explained = 80.3%
GCV = 117.89 Scale est. = 117.68 n = 9918
Now, each species also have its own parametric estimate.
The next model is the one that I have trouble understanding:
gam.interaction2 <- gam(density ~ Taxon + s(wl) + s(wl, by = Taxon, m=1), data = df)
plot(gam.interaction2, pages=1)
I have no clear idea of what these graphs represent.
summary(gam.interaction2)
Gives:
Family: gaussian
Link function: identity
Formula:
density ~ Taxon + s(wl) + s(wl, by = Taxon, m = 1)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.3132 0.1463 275.6 <2e-16 ***
TaxonSpeciesB -26.0221 0.2157 -120.6 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(wl) 8.940 8.994 30.06 <2e-16 ***
s(wl):TaxonSpeciesA 8.001 8.000 11.61 <2e-16 ***
s(wl):TaxonSpeciesB 8.001 8.000 19.59 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.808 Deviance explained = 80.8%
GCV = 114.96 Scale est. = 114.65 n = 9918
The parametric part of gam.interaction2
is about the same as for gam.interaction1
, but now there are three estimates for smooth terms, which I cannot interpret.
Thanks in advance to anyone who will take the time to help me understand the differences in the three models.
Best Answer
gam1
andgam2
are fine; they are different models, although they are trying to do the same thing, which is model group-specific smooths.The
gam1
formdoes this by estimating a separate smoother for each level of
f
(assuming thatf
is a standard factor), and indeed, a separate smoothness parameter is estimated for each smooth also.The
gam2
formachieves the same aim as
gam1
(of modelling the smooth relationship betweenx
andy
for each level off
) but it does so by estimating a global or average smooth effect ofx
ony
(thes(x)
term) plus a smooth difference term (the seconds(x, by = f, m = 1)
term). As the penalty here is on the first derivative (m = 1) for this difference smoother, it is penalising departure from a flat line, which when added to the global or average smooth term (
s(x)`) reflects a deviation from the global or average effect.gam3
formis wrong regardless of how well it may fit in a particular situation. The reason I say it is wrong is that each smooth specified by the
s(x, by = f)
part is centred about zero because of the sum-to-zero constraint imposed for model identifiability. As such, there is nothing in the model that accounts for the mean of $Y$ in each of the groups defined byf
. There is only the overall mean given by the model intercept. This means that smoother, which is centred about zero and which has had the flat basis function removed from the basis expansion ofx
(as it is confounded with the model intercept) is now responsible for modelling both the difference in the mean of $Y$ for the current group and the overall mean (model intercept), plus the smooth effect ofx
on $Y$.None of these models is appropriate for your data however; ignoring for now the wrong distribution for the response (
density
can't be negative and there is a heterogeneity issue which a non-Gaussianfamily
would fix or address), you haven't taken into account the grouping by flower (SampleID
in your dataset).If your aim is to model
Taxon
specific curves then a model of the form would be a starting point:where I have added a random effect for
SampleID
and boosted the size of the basis expansion for theTaxon
specific smooths.This model,
m1
, models the observations as coming from either a smoothwl
effect depending on which species (Taxon
) the observation comes from (theTaxon
parametric term just sets the meandensity
for each species and is needed as discussed above), plus a random intercept. Taken together, the curves for individual flowers arise from shifted versions of theTaxon
specific curves, with the amount of shift given by the random intercept. This model assumes that all individuals have the same shape of smooth as given by the smooth for the particularTaxon
that individual flower comes from.Another version of this model is the
gam2
form from above but with an added random effectThis model fits better but I don't think it is solving the problem at all, see below. One thing I think it does suggest is that the default
k
is potentially too low for theTaxon
specific curves in these models. There is still a lot of residual smooth variation that we're not modelling if you look at the diagnostic plots.This model is more than likely too restrictive for your data; some of the curves in your plot of the individual smooths do not appear to be simple shifted versions of the
Taxon
average curves. A more complex model would allow for individual-specific smooths too. Such a model can estimated using thefs
or factor-smooth interaction basis. We still wantTaxon
specific curves but we also want to have a separate smooth for eachSampleID
, but unlike theby
smooths, I would suggest that initially you want all of thoseSampleID
-specific curves to have the same wiggliness. In the same sense as the random intercept that we included earlier, thefs
basis adds a random intercept, but also include a "random" spline (I use the scare quotes as in a Bayesian interpretation of the GAM, all these models are just variations on random effects).This model is fitted for your data as
Note that I have increase
k
here, in case we need more wiggliness in theTaxon
-specific smooths. We still need theTaxon
parametric effect for reasons explained above.That model takes a long time to fit on a single core with
gam()
—bam()
will most likely be better at fitting this model as there are a relatively large number of random effects here.If we compare these models with a smoothness parameter selection-corrected version of AIC we see just how dramatically better this latter model,
m3
, is compared to the other two even though it uses an order of magnitude more degrees of freedomIf we look at this model's smooths we get a better idea of how it is fitting the data:
(Note this was produced using
draw(m3)
using thedraw()
function from my gratia package. The colours in the lower-left plot are irrelevant and don't help here.)Each
SampleID
's fitted curve is built up from either the intercept or the parametric termTaxonSpeciesB
plus one of the twoTaxon
-specific smooths, depending on to whichTaxon
eachSampleID
belongs, plus it's ownSampleID
-specifc smooth.Note that all these models are still wrong as they don't account for the heterogeneity; gamma or Tweedie models with a log link would be my choices to take this further. Something like:
But I'm having trouble with this model fitting at the moment, which might indicate it is too complex with multiple smooths of
wl
included.An alternative form is to use the ordered factor approach, which does an ANOVA-like decomposition on the smooths:
Taxon
parametric term is retaineds(wl)
is a smooth that will represent the reference levels(wl, by = Taxon)
will have a separate difference smooth for each other level. In your case you'll have only one of these.This model is fitted like
m3
,but the interpretation is different; the first
s(wl)
will refer toTaxonA
and the smooth implied bys(wl, by = fTaxon)
will be a smooth difference between the smooth forTaxonA
and that ofTaxonB
.