In my model, I have two variables, distance and time, that influence the performance of different subjects (success or failure). The relationship between distance and time is nonlinear, not separable into single additive terms, and (probably) vary by subject. There are other variables involved in other terms in the model, but my focus at the moment is these two variables.
I don't have a parametric form that I can expect, but I do have thousands of observations across hundreds of subjects, and so can estimate the overall interaction between distance and time nonparametrically. I have been using mgcv::bam (the large-data version of mgcv::gam) to estimate a (simplified) model along the lines of:
mgcv::bam(success ~ s(distance, time), data=mydata)
However, what I'd like to explore is how this distance + time relationship varies across subjects, using a random effects framework. Something like:
mgcv::bam(success ~ s(distance, time, subject, bs="re"), data=mydata)
Running R code this way doesn't produce an error, but I am not sure it's doing what I think it is. I have not found any references to using multiple numeric variables in a mgcv random effect smooth this way, and wonder if there's a reason for that.
As a side note, one approach might be to combine distance and time into a single variable like "speed" (distance / time), and then use:
mgcv::bam(success ~ s(speed, subject, bs="re"), data=mydata)
… but this potentially loses some of the details in the complex way distance and time interact, leading to my question.
Consulting the documentation for the mgcv::gam function in R, you can specify a smooth on a factor variable to be a random slope of another numeric variable.
If g is a factor and x is numeric, then s(x,g,bs="re") produces an i.i.d. normal random slope relating the response to x for each level of g.
It's unclear to me whether you can specify more than one numeric variable in conjunction with a random effect smooth. i.e. s(x, y, g, bs="re"). Or even if this is a sensible think to ask.
Simplified example:
## adapted from example in mgcv docs
dat <- gamSim(1,n=400,scale=2) ## simulate 4 term additive truth
fac <- sample(1:20,400,replace=TRUE)
b <- rnorm(20)*.5
dat$y <- dat$y + b[fac]
dat$fac <- as.factor(fac)
rm1 <- gam(y ~ s(x0, fac,bs="re") + s(x1, x2, fac, bs="re") + s(x3),data=dat,method="ML")
The smooth on x0 & fac is ok per the documentation, but I don't know if the smooth on x1, x2, & fac is valid and interpretable.
Best Answer
Huh... I made my post as a guest on SO because I am still on a suspension, but then the question got migrated here!
Correct. The function name "s" does not mean "smooth function" when
s()
is used to construct a random effect. Broadly speaking,s()
is just a model term constructor routine that constructs a design matrix and a penalty matrix.s(x1, x2, by = fac)
gives you something pretty close to what you want, except that as you said, data from different factor levels are treated independently. Technically, "close" means thats(x1, x2, by = fac)
gives you the correct design matrix but not the correct penalty matrix. In this regard, you are probably aiming atte(x1, x2, fac, d = c(2, 1), bs = c("tp", "re"))
. I have never seen such model term before, but its construction is definitely possible inmgcv
:You can check that this smooth term has 2 smoothing parameters as expected, one for the
s(x1, x2, bs = 'tp')
margin, the other for thes(f, bs = 're')
margin.Specification of
k
turns out subtle. You need to explicitly passnlevels(f)
to the random effect margin. For example, if you want a rank-10 thin-plate regression spline,At first I was thinking that perhaps we can simply pass
NA
to the random effect margin, but it turns out not!This might implies that there is a tiny bug... will have a check when available.