Predicting with GAM (mgcv) and categorical/factor covariate in R

generalized-additive-modelmarginal-effectmgcvpredictive-modelsr

I have some data for multiple users and I want to generate some marginal effects for each user using GAM modelling. If I do this exercise for just a single user (John) as:

model_1 = mgcv::gam(y ~ s(speed) + s(length) + s(price), data1) with data1 having data only for John, then I can get the speed-effect with prediction_1 = mgcv::predict.gam(model_1, test_set, type="response") where test_set is a table with speed as sequence seq(0,100,by=5) , while both length and price are fixed at John's mean values. It is the same as running plot(ggeffects::ggpredict(model_1), facets = TRUE) .

Now, I want to repeat this exercise for John but with a model that includes all my users. Hence, I am having a categorical/factor variable user_id and run the following model:

model_2 = mgcv::gam(y ~ user_id + s(speed) + s(length) + s(price), data2) with data2 having all users' data. Now, If I want to obtain again only John's speed-effect but with model_2 , I am trying predictions_2 = mgcv::predict.gam(model_2, test_set2, type="response") where test_set2 only having test_set2$user_id="John" , again speed as a (0-100) sequence and both length and price fixed at John's mean values.

The results from these two exercises, although I was expecting to be the same, are different. I have also tried test_set2 with length and price fixed at all users' mean values but again I can not get identical results for John as with prediction_1.

Could you please help me understand what I am doing wrong or missing here? How can I obtain prediction_1 results using the model with the categorical? Much appreciated.

Best Answer

If you want to model individual-specific effects (separate smooth effects for individuals), then you have to include terms in the model for each user.

In your example, all you did was introduce a different mean (constant) for each person. All individuals shared the same smooth effects of speed and length.

The HGAM paper mentioned by @Roland in the comments discusses modelling options for group and individual level effects. You basically have two options:

  1. Do you want a common, shared effect, plus separate smooths for each individual, or
  2. Only separate smooths for each individual

The end result should be similar in terms of fit, but differ in terms of how they decompose the effects. Which you use will depend on your questions and the specific nature of the system you are studying.

Within those options, you have 2 further choices. Do you want

  1. random smooths for the individual effects, or
  2. factor-by smooths for the individual effects

The smooths in Option 1 would all share the same wiggliness, while the smooths in Option 2 would be able to have different wigglinesses if support by the data.

Since we wrote that paper, mgcv has gained other ways to specify the Option 2 smooths.

If you want to just estimate a separate smooth for each individual, then you can use:

gam(y ~ user_id + s(speed, by = user_id) +
                  s(length, by = user_id) +
                  s(price, by = user_id),
    data1, method = "REML")

whereas, if you have many individuals and expect the smooths to have different shapes but largely the same degree of wiggliness, you could fit

gam(y ~ s(speed, user_id, bs = "fs") +
        s(length, user_id, bs = "fs") +
        s(price, user_id, bs = "fs"),
    data1, method = "REML")