From version 1.8.8 of mgcv predict.gam
has gained an exclude
argument which allows for the zeroing out of terms in the model, including random effects, when predicting, without the dummy trick that was suggested previously.
predict.gam
and predict.bam
now accept an 'exclude'
argument allowing terms (e.g. random effects) to be zeroed for prediction. For efficiency, smooth terms not in terms
or in exclude
are no longer evaluated, and are instead set to zero or not returned. See ?predict.gam
.
library("mgcv")
require("nlme")
dum <- rep(1,18)
b1 <- gam(travel ~ s(Rail, bs="re", by=dum), data=Rail, method="REML")
b2 <- gam(travel ~ s(Rail, bs="re"), data=Rail, method="REML")
head(predict(b1, newdata = cbind(Rail, dum = dum))) # ranefs on
head(predict(b1, newdata = cbind(Rail, dum = 0))) # ranefs off
head(predict(b2, newdata = Rail, exclude = "s(Rail)")) # ranefs off, no dummy
> head(predict(b1, newdata = cbind(Rail, dum = dum))) # ranefs on
1 2 3 4 5 6
54.10852 54.10852 54.10852 31.96909 31.96909 31.96909
> head(predict(b1, newdata = cbind(Rail, dum = 0))) # ranefs off
1 2 3 4 5 6
66.5 66.5 66.5 66.5 66.5 66.5
> head(predict(b2, newdata = Rail, exclude = "s(Rail)")) # ranefs off, no dummy
1 2 3 4 5 6
66.5 66.5 66.5 66.5 66.5 66.5
Older approach
Simon Wood has used the following simple example to check this is working:
library("mgcv")
require("nlme")
dum <- rep(1,18)
b <- gam(travel ~ s(Rail, bs="re", by=dum), data=Rail, method="REML")
predict(b, newdata=data.frame(Rail="1", dum=0)) ## r.e. "turned off"
predict(b, newdata=data.frame(Rail="1", dum=1)) ## prediction with r.e
Which works for me. Likewise:
dum <- rep(1, NROW(na.omit(Orthodont)))
m <- gam(distance ~ s(age, bs = "re", by = dum) + Sex, data = Orthodont)
predict(m, data.frame(age = 8, Sex = "Female", dum = 1))
predict(m, data.frame(age = 8, Sex = "Female", dum = 0))
also works.
So I would check the data you are supplying in newdata
is what you think it is as the problem may not be with VesselID
— the error is coming from the function that would have been called by the predict()
calls in the examples above, and Rail
is a factor in the first example.
mgcv uses a thin plate spline basis as the default basis for it's smooth terms. To be honest it likely makes little difference in many applications which of these you choose, though in some situations or with very large data set sizes, other basis types might be used to good effect. Thin plate splines tend to have better RMSE performance than the other three you mention but are more computationally expensive to set up. Unless you have a reason to use the P or B spline bases, use thin plate splines unless you have a lot of data and if you have a lot of data consider the cubic spline option.
k
doesn't set the number of knots, at least not in the default thin plate spline basis. What k
does is to set the dimensionality of the basis expansion; you'll end up with k - 1
basis functions. In mgcv Simon Wood does a trick to reduce the rank of basis dimension. IIRC, in the usual thin plate spline basis there is a knot at each data location, but this is wasteful as once you've set up this large basis you end up using far fewer degrees of freedom in the fitted function. What Simon does is to eigen decompose the matrix of basis functions and choose the eigenvectors of the decomposition corresponding to the k - 1
largest eigenvalues. This has the effect of concentrating the main wiggliness "information" of the full basis in a reduced rank form.
The choice of k
is important and the default is arbitrary and something you want to check (see gam.check()
), but the critical observation is that you want to set k
to be large enough to contain the envisioned dimensionality of the underlying function you are trying to recover from the data. In practice, one tends to fit with a modest k
given the data set size and then use gam.check()
on the resulting model to check if k
was large enough. If it wasn't, increase k
and refit. Rinse and repeat...
You are most likely going to want to fit the model using REML (or ML) smoothness selection via method = "REML"
or method = "ML"
: this treats the model as a mixed effects one with the wiggly parts of the spline bases being treated as special random effects terms. Simon Wood has shown that REML (or ML) selection performs better than GCV, which can undersmooth in situations where the objective function is flat around the optimal smoothness parameter value.
The ridge penalty mentioned by @generic_user is taken care of for you, so you can ignore this part of setting up the model.
Best Answer
The
glht()
function for generalized linear hypotheses from themultcomp
package can be used to carry out various kinds of contrasts using a range of different p-value adjustments. The contrasts you are looking for are also called "Tukey" contrasts for all pairwise comparisons. The p-value adjustments include single-step, Shaffer, Westfall, and allp.adjust
methods, see?summary.glht
.As @GavinSimpson pointed out in the comments: For
gam()
objects frommgcv
this does not work out of the box but requires some manual intervention. Forlmer()
fromlme4
everything works conveniently. I illustrate below how both packages can be used withmultcomp
to obtain equivalent results. For illustration I use thesleepstudy
data fromlme4
but collapse the numeric regressorDays
to a three-level factor (merely for illustration purposes):Then
glht()
can be used to set up all pairwise (aka Tukey) contrasts for theDays
factor. Thesummary()
method then applies the p-value adjustment (single-step, by default).The same model can be fitted with
gam()
as described in the question.However, the
mcp(Days = "Tukey")
method for describing the Tukey contrasts does not cooperate withgam()
output and hence fails:However, it is not difficult (albeit a bit technical and tedious) to set up the contrast matrix by hand.
The first columns of the contrast matrix show what is needed here: As the
low
coefficient is constrained to zero in the model,med - low
is simplymed
and analogously forhigh - low
. The last row then shows the contrast forhigh - med
:And with this contrast matrix we can conduct the pairwise comparison with
glht()
:Another quite convenient way to indicate the contrasts to be tested is through character strings. This can set up linear functions based on the effect names from
names(coef(m2))
. And for factors with fewer levels (and hence fewer Tukey contrasts) this works quite nicely - but if the comparisons become more complex it's possibly easier to constract the contrast matrix as above.