Solved – In R package mgcv, is it valid to have a random effect smooth on two continuous variables

mgcvr

In my model, I have two variables, distance and time, that influence the performance of different subjects (success or failure). The relationship between distance and time is nonlinear, not separable into single additive terms, and (probably) vary by subject. There are other variables involved in other terms in the model, but my focus at the moment is these two variables.

I don't have a parametric form that I can expect, but I do have thousands of observations across hundreds of subjects, and so can estimate the overall interaction between distance and time nonparametrically. I have been using mgcv::bam (the large-data version of mgcv::gam) to estimate a (simplified) model along the lines of:

mgcv::bam(success ~ s(distance, time), data=mydata)

However, what I'd like to explore is how this distance + time relationship varies across subjects, using a random effects framework. Something like:

mgcv::bam(success ~ s(distance, time, subject, bs="re"), data=mydata)

Running R code this way doesn't produce an error, but I am not sure it's doing what I think it is. I have not found any references to using multiple numeric variables in a mgcv random effect smooth this way, and wonder if there's a reason for that.

As a side note, one approach might be to combine distance and time into a single variable like "speed" (distance / time), and then use:

mgcv::bam(success ~ s(speed, subject, bs="re"), data=mydata)

… but this potentially loses some of the details in the complex way distance and time interact, leading to my question.

Consulting the documentation for the mgcv::gam function in R, you can specify a smooth on a factor variable to be a random slope of another numeric variable.

If g is a factor and x is numeric, then s(x,g,bs="re") produces an i.i.d. normal random slope relating the response to x for each level of g.

It's unclear to me whether you can specify more than one numeric variable in conjunction with a random effect smooth. i.e. s(x, y, g, bs="re"). Or even if this is a sensible think to ask.

Simplified example:

## adapted from example in mgcv docs
dat <- gamSim(1,n=400,scale=2) ## simulate 4 term additive truth

fac <- sample(1:20,400,replace=TRUE)
b <- rnorm(20)*.5
dat$y <- dat$y + b[fac]
dat$fac <- as.factor(fac)

rm1 <- gam(y ~ s(x0, fac,bs="re") + s(x1, x2, fac, bs="re") +    s(x3),data=dat,method="ML")

The smooth on x0 & fac is ok per the documentation, but I don't know if the smooth on x1, x2, & fac is valid and interpretable.

Best Answer

Huh... I made my post as a guest on SO because I am still on a suspension, but then the question got migrated here!

So, if I understand you correctly, there's not really any similarity between the smooth s(x1, x2) and the random effect s(x1, x2, fac, bs = "re"), correct?

Correct. The function name "s" does not mean "smooth function" when s() is used to construct a random effect. Broadly speaking, s() is just a model term constructor routine that constructs a design matrix and a penalty matrix.

What I was envisioning was something smoothing in 2 dimensions like the former, but with some deviations from the average by factor level. You can get separate smooths per factor level using s(x1, x2, by=fac), but that completely separates the data for each factor level, rather than doing some partial pooling.

s(x1, x2, by = fac) gives you something pretty close to what you want, except that as you said, data from different factor levels are treated independently. Technically, "close" means that s(x1, x2, by = fac) gives you the correct design matrix but not the correct penalty matrix. In this regard, you are probably aiming at te(x1, x2, fac, d = c(2, 1), bs = c("tp", "re")). I have never seen such model term before, but its construction is definitely possible in mgcv:

library(mgcv)

x1 <- runif(1000)
x2 <- runif(1000)
f <- gl(5, 200)

## "smooth.spec" object
smooth_spec <- te(x1, x2, f, d = c(2, 1), bs = c("tp", "re"))

## "smooth" object
sm <- smooth.construct(smooth_spec,
                       data = list(x1 = x1, x2 = x2, f = f),
                       knots = NULL)

You can check that this smooth term has 2 smoothing parameters as expected, one for the s(x1, x2, bs = 'tp') margin, the other for the s(f, bs = 're') margin.

Specification of k turns out subtle. You need to explicitly pass nlevels(f) to the random effect margin. For example, if you want a rank-10 thin-plate regression spline,

## my example factor `f` has 5 levels
smooth_spec <- te(x1, x2, f, d = c(2, 1), bs = c("tp", "re"), k = c(10, 5))
sapply(smooth_spec$margin, "[[", "bs.dim")
# [1] 10  5

At first I was thinking that perhaps we can simply pass NA to the random effect margin, but it turns out not!

smooth_spec <- te(x1, x2, f, d = c(2, 1), bs = c("tp", "re"), k = c(10, NA))
sapply(smooth_spec$margin, "[[", "bs.dim")
# [1] 25  5  ## ?? why is it 25? something has gone wrong!

This might implies that there is a tiny bug... will have a check when available.

Related Solutions

Solved – Centering constraints in MGCV GAM

What that means is that each smooth created using a by term will be centred at zero. In effect this means that the data for each level of by has the same mean (i.e. the intercept or constant term in the model). What Simon Wood means is that if the data have different mean values (Site A has higher values than Site B say and by = Site) then because of the centring constraints this difference in the means of the different groups will not be taken into account in the model.

Using an example from ?gam.model I will demonstrate the difference:

## Factor `by' variable example (with a spurious covariate x0)
## simulate data...

dat <- gamSim(4)

## fit model...
b <- gam(y ~ fac + s(x2, by=fac) + s(x0), data=dat)
bb <- gam(y ~ s(x2, by=fac) + s(x0), data=dat) ## without fac
summary(b)
summary(bb)

Note how in model b we include the fac variable as a separate parametric term in the model formula as well as in the s( ) part. The intercept in this model will then represent the mean for the first group (level) of fac and the coefficients for fac in the model (and in the summary output) will represent the difference in means from the first group of the other groups.

Look closely at the output from summary(), notice how the deviance explained by b is much higher than that of bb (without the parametric fac term), and consequently b has a lower GCV score (lower is better) and a lower scale parameter. The higher scale parameter in bb is due to there being a large amount of unexplained variance (the between group effect of fac) left over in the simpler model.

R> summary(b)
....
R-sq.(adj) =  0.644   Deviance explained = 65.6%
GCV score = 3.9013  Scale est. = 3.7517    n = 400

R> summary(bb)
....
R-sq.(adj) =  0.451   Deviance explained = 47.3%
GCV score = 6.0322  Scale est. = 5.7758    n = 400

Generalized Additive Models – Predicting with Random Effects in mgcv gam

From version 1.8.8 of mgcv predict.gam has gained an exclude argument which allows for the zeroing out of terms in the model, including random effects, when predicting, without the dummy trick that was suggested previously.

predict.gam and predict.bam now accept an 'exclude' argument allowing terms (e.g. random effects) to be zeroed for prediction. For efficiency, smooth terms not in terms or in exclude are no longer evaluated, and are instead set to zero or not returned. See ?predict.gam.

library("mgcv")
require("nlme")
dum <- rep(1,18)
b1 <- gam(travel ~ s(Rail, bs="re", by=dum), data=Rail, method="REML")
b2 <- gam(travel ~ s(Rail, bs="re"), data=Rail, method="REML")

head(predict(b1, newdata = cbind(Rail, dum = dum)))    # ranefs on
head(predict(b1, newdata = cbind(Rail, dum = 0)))      # ranefs off
head(predict(b2, newdata = Rail, exclude = "s(Rail)")) # ranefs off, no dummy

> head(predict(b1, newdata = cbind(Rail, dum = dum)))    # ranefs on
       1        2        3        4        5        6 
54.10852 54.10852 54.10852 31.96909 31.96909 31.96909  
> head(predict(b1, newdata = cbind(Rail, dum = 0)))      # ranefs off
   1    2    3    4    5    6 
66.5 66.5 66.5 66.5 66.5 66.5
> head(predict(b2, newdata = Rail, exclude = "s(Rail)")) # ranefs off, no dummy
   1    2    3    4    5    6 
66.5 66.5 66.5 66.5 66.5 66.5

Older approach

Simon Wood has used the following simple example to check this is working:

library("mgcv")
require("nlme")
dum <- rep(1,18)
b <- gam(travel ~ s(Rail, bs="re", by=dum), data=Rail, method="REML")
predict(b, newdata=data.frame(Rail="1", dum=0)) ## r.e. "turned off"
predict(b, newdata=data.frame(Rail="1", dum=1)) ## prediction with r.e

Which works for me. Likewise:

dum <- rep(1, NROW(na.omit(Orthodont)))
m <- gam(distance ~ s(age, bs = "re", by = dum) + Sex, data = Orthodont)
predict(m, data.frame(age = 8, Sex = "Female", dum = 1))
predict(m, data.frame(age = 8, Sex = "Female", dum = 0))

also works.

So I would check the data you are supplying in newdata is what you think it is as the problem may not be with VesselID — the error is coming from the function that would have been called by the predict() calls in the examples above, and Rail is a factor in the first example.