Solved – In R package mgcv, is it valid to have a random effect smooth on two continuous variables

mgcvr

In my model, I have two variables, distance and time, that influence the performance of different subjects (success or failure). The relationship between distance and time is nonlinear, not separable into single additive terms, and (probably) vary by subject. There are other variables involved in other terms in the model, but my focus at the moment is these two variables.

I don't have a parametric form that I can expect, but I do have thousands of observations across hundreds of subjects, and so can estimate the overall interaction between distance and time nonparametrically. I have been using mgcv::bam (the large-data version of mgcv::gam) to estimate a (simplified) model along the lines of:

mgcv::bam(success ~ s(distance, time), data=mydata)

However, what I'd like to explore is how this distance + time relationship varies across subjects, using a random effects framework. Something like:

mgcv::bam(success ~ s(distance, time, subject, bs="re"), data=mydata)

Running R code this way doesn't produce an error, but I am not sure it's doing what I think it is. I have not found any references to using multiple numeric variables in a mgcv random effect smooth this way, and wonder if there's a reason for that.

As a side note, one approach might be to combine distance and time into a single variable like "speed" (distance / time), and then use:

mgcv::bam(success ~ s(speed, subject, bs="re"), data=mydata)

… but this potentially loses some of the details in the complex way distance and time interact, leading to my question.

Consulting the documentation for the mgcv::gam function in R, you can specify a smooth on a factor variable to be a random slope of another numeric variable.

If g is a factor and x is numeric, then s(x,g,bs="re") produces an i.i.d. normal random slope relating the response to x for each level of g.

It's unclear to me whether you can specify more than one numeric variable in conjunction with a random effect smooth. i.e. s(x, y, g, bs="re"). Or even if this is a sensible think to ask.

Simplified example:

## adapted from example in mgcv docs
dat <- gamSim(1,n=400,scale=2) ## simulate 4 term additive truth

fac <- sample(1:20,400,replace=TRUE)
b <- rnorm(20)*.5
dat$y <- dat$y + b[fac]
dat$fac <- as.factor(fac)

rm1 <- gam(y ~ s(x0, fac,bs="re") + s(x1, x2, fac, bs="re") +    s(x3),data=dat,method="ML")

The smooth on x0 & fac is ok per the documentation, but I don't know if the smooth on x1, x2, & fac is valid and interpretable.

Best Answer

Huh... I made my post as a guest on SO because I am still on a suspension, but then the question got migrated here!

So, if I understand you correctly, there's not really any similarity between the smooth s(x1, x2) and the random effect s(x1, x2, fac, bs = "re"), correct?

Correct. The function name "s" does not mean "smooth function" when s() is used to construct a random effect. Broadly speaking, s() is just a model term constructor routine that constructs a design matrix and a penalty matrix.

What I was envisioning was something smoothing in 2 dimensions like the former, but with some deviations from the average by factor level. You can get separate smooths per factor level using s(x1, x2, by=fac), but that completely separates the data for each factor level, rather than doing some partial pooling.

s(x1, x2, by = fac) gives you something pretty close to what you want, except that as you said, data from different factor levels are treated independently. Technically, "close" means that s(x1, x2, by = fac) gives you the correct design matrix but not the correct penalty matrix. In this regard, you are probably aiming at te(x1, x2, fac, d = c(2, 1), bs = c("tp", "re")). I have never seen such model term before, but its construction is definitely possible in mgcv:

library(mgcv)

x1 <- runif(1000)
x2 <- runif(1000)
f <- gl(5, 200)

## "smooth.spec" object
smooth_spec <- te(x1, x2, f, d = c(2, 1), bs = c("tp", "re"))

## "smooth" object
sm <- smooth.construct(smooth_spec,
                       data = list(x1 = x1, x2 = x2, f = f),
                       knots = NULL)

You can check that this smooth term has 2 smoothing parameters as expected, one for the s(x1, x2, bs = 'tp') margin, the other for the s(f, bs = 're') margin.

Specification of k turns out subtle. You need to explicitly pass nlevels(f) to the random effect margin. For example, if you want a rank-10 thin-plate regression spline,

## my example factor `f` has 5 levels
smooth_spec <- te(x1, x2, f, d = c(2, 1), bs = c("tp", "re"), k = c(10, 5))
sapply(smooth_spec$margin, "[[", "bs.dim")
# [1] 10  5

At first I was thinking that perhaps we can simply pass NA to the random effect margin, but it turns out not!

smooth_spec <- te(x1, x2, f, d = c(2, 1), bs = c("tp", "re"), k = c(10, NA))
sapply(smooth_spec$margin, "[[", "bs.dim")
# [1] 25  5  ## ?? why is it 25? something has gone wrong!

This might implies that there is a tiny bug... will have a check when available.