Mgcv Error in gam. Model has more coefficients than data

mgcvr

I'm completely new to GAM's so please bear with me. We have two groups that have an outcome measured over time. The trends are clearly non-linear and while measurements of the outcome differ at the start of time, they gradually converge to be quite similar. We are interested in trying to model and test the latest time that differences between the two groups are still statistically significant.

My plan was to run a gam model and then test group means at regular time points using the emmeans package.

If I run the following model, everything works great:

gam_mod <- gam(lactate ~ group + s(time, by = group), data = dat_long)

I can get predictions, do plots and test differences.

However, while most observations are independent there are some individuals with multiple observations – in total: 53 subjects with 64 observations. So, it may not make that much difference if I ignore the repeated measures given there aren't many.

In any case if I run a model to try and account for the repeats, I get an error:

> gam_mod <- gam(lactate ~ group + s(time, by = group) + s(id, bs = "re"), data = dat_long)
Error in gam(lactate ~ group + s(time, by = group) + s(id, bs = "re"),  : 
  Model has more coefficients than data

If I want to specify a random intercept, is this model specification correct? If so, is the error just because I don't have enough data (how many parameters are being estimated?), or are there issues because of the larger numbers of singleton clusters?

Thanks

Best Answer

Your model requires the estimation of 52 + (9*2) + 1 + 1 = 72 parameters:

  • 53-1 for the random effect (+ s(id, bs = 're')),
  • 9 each for the by group smooths of time ( + s(time, by = group)),
  • 1 for the difference between reference group and the other group (+ group), and
  • 1 for the model constant term which equal the mean response for the reference group

which can't be done with only 64 observations (and without restricting the parameters in some way - which gam can't do).

There's basically nothing you can do here (with gam()) if you want to include a random intercept (at least not without doing something very hacky with unknown [to me at least] side effects). I would just ignore the repeated observations for some individuals as they're not that large a component of the data.

You might try to fit this using gamm4() from the gamm4 package, which uses the lme4 package under the hood to fit the GAMM:

gam_mod <- gamm4(lactate ~ group + s(time, by = group), data = dat_long,
                 random = ~ (1 | id))

I don't know if gamm4 can even fit this, but worth a try. It won't treat the subject specific intercepts as requiring a parameter per subject; basically it should see this as requiring just the estimation of one additional parameter, a variance for the distribution of the random effects. As such the model should require only (9*2) + 1 + 1 + 1 parameters.

Finally, lactate implies a response that is strictly positive or censored below some detection limit, with a non-constant mean-variance relationship. Such response data aren't well handled by the Gaussian distribution. You might want to consider an alternative distribution for the response (such as Gamma if the response isn't censored).

Related Question