Mgcv Error in gam. Model has more coefficients than data

mgcvr

I'm completely new to GAM's so please bear with me. We have two groups that have an outcome measured over time. The trends are clearly non-linear and while measurements of the outcome differ at the start of time, they gradually converge to be quite similar. We are interested in trying to model and test the latest time that differences between the two groups are still statistically significant.

My plan was to run a gam model and then test group means at regular time points using the emmeans package.

If I run the following model, everything works great:

gam_mod <- gam(lactate ~ group + s(time, by = group), data = dat_long)

I can get predictions, do plots and test differences.

However, while most observations are independent there are some individuals with multiple observations – in total: 53 subjects with 64 observations. So, it may not make that much difference if I ignore the repeated measures given there aren't many.

In any case if I run a model to try and account for the repeats, I get an error:

> gam_mod <- gam(lactate ~ group + s(time, by = group) + s(id, bs = "re"), data = dat_long)
Error in gam(lactate ~ group + s(time, by = group) + s(id, bs = "re"),  : 
  Model has more coefficients than data

If I want to specify a random intercept, is this model specification correct? If so, is the error just because I don't have enough data (how many parameters are being estimated?), or are there issues because of the larger numbers of singleton clusters?

Thanks

Best Answer

Your model requires the estimation of 52 + (9*2) + 1 + 1 = 72 parameters:

53-1 for the random effect (+ s(id, bs = 're')),
9 each for the by group smooths of time ( + s(time, by = group)),
1 for the difference between reference group and the other group (+ group), and
1 for the model constant term which equal the mean response for the reference group

which can't be done with only 64 observations (and without restricting the parameters in some way - which gam can't do).

There's basically nothing you can do here (with gam()) if you want to include a random intercept (at least not without doing something very hacky with unknown [to me at least] side effects). I would just ignore the repeated observations for some individuals as they're not that large a component of the data.

You might try to fit this using gamm4() from the gamm4 package, which uses the lme4 package under the hood to fit the GAMM:

gam_mod <- gamm4(lactate ~ group + s(time, by = group), data = dat_long,
                 random = ~ (1 | id))

I don't know if gamm4 can even fit this, but worth a try. It won't treat the subject specific intercepts as requiring a parameter per subject; basically it should see this as requiring just the estimation of one additional parameter, a variance for the distribution of the random effects. As such the model should require only (9*2) + 1 + 1 + 1 parameters.

Finally, lactate implies a response that is strictly positive or censored below some detection limit, with a non-constant mean-variance relationship. Such response data aren't well handled by the Gaussian distribution. You might want to consider an alternative distribution for the response (such as Gamma if the response isn't censored).

Related Solutions

Mgcv – why does gam fit change so much with random effect

You might be better off using the tw() family to fit a Tweedie model, at least then you'll have a proper likelihood and can use things like AIC etc to compare fits.

You're generating your confidence intervals incorrectly. That they include negative values should've been a big warning sign as those values simply aren't plausible values. You should use predict(...., type = "link", se.fit = TRUE), compute the confidence interval on the link scale, and then transform the fitted values and the upper and lower limits of the confidence interval back to the response scale using the inverse of the link function (here you would just use exp() on the fitted values and upper and lower interval values).

If you have 94 subjects and only 123 observations, that implies that for most of the subjects you have a single observation. By including the random effect you are soaking up (modelling) some of the variation in the response that is due to individuals (subjects) but because variation between subjects is pretty much all you have in your data set (you have very little data to inform the within-Subject part of the model) there's little left for the effect of time to model. Hence the flat fitted lines.

Try plotting your data faceted by Subject:

ggplot(ratios_wide, aes(x = time, y = ratio_SII)) +
  geom_point() +
  facet_wrap( ~ id)

and see if there is much in the way of change over time in those plots. That should help you understand why you get such different time smooths when you include or exclude the random effect.

Also, the name of your response variable ratio_SII implies you've divided your original response variable by another value. If you did, can you explain what the original data and the thing you divided it by are / represent? This is often a mistake I see people make where they start with a count response but need to normalise it by some other variable (to account for effort or some such) and so they divide the nice integer count data they had by this value and end up having to model a now continuous variable which leads them to doing things like quasipoisson models, when they could have just stuck with the original count data, fitted a Poisson model, and used a offset to account for the thing they need to normalise their data by...

GAM partial effect line does not overlap raw data

It's because you forgot about the spatial component; you didn't tell visreg what covariate values to use for the spatial locations so it will have gone with a default which might be the mean of each spatial coordinate. If they happen to be in a place with higher counts, that would shift the plot up. The second model doesn't have a spatial component so there's nothing to condition on in this plot so the fit does go through the data as you would expect.

The reason plot.gam is producing what you want is because this is showing a partial effect which you then shift by adding on the intercept. This is basically ignoring the spatial component in the first model (it's contribution is being set to 0) which is what you also get if you exclude this spatial term from the model.

You can proceed to do what you want but visreg style without conditioning the predictions on particular values of the x and y coordinates with:

newdf <- with(cellDF,
              data.frame(sppRichness = seq(min(sppRichness),
                                           max(sppRichness),
                                           length = 100),
                         # have to provide something for x and y
                         x = 1, y = 1))
p <- predict(gamFit, new data = newdf,
             exclude = "s(x,y)", # exclude the spatial effect from yhat
             type = "link", se.fit = TRUE)
newdf <- cbind(newdf, as.data.frame(p))
newdf <- transform(newdf,
                   upper = fit + (2*se.fit),
                   lower = fit - (2*se.fit))

As this is a Gaussian model you don't need to back transform to the response scale - if it was you'd need to apply the inverse of the link function to each of fit, lower, and upper before plotting.

The exclude argument to predict.gam allows you to generate predictions ignoring the listed term(s). Here we are doing what visreg is doing but we don't need to condition on a spatial coordinate. This might not go as close to the middle of the data are you might like, but that'll be because some of the response magnitude is removed from the smooth of sppRichness and is accounted for by the spatial term.

Best Answer

Related Solutions

Mgcv – why does gam fit change so much with random effect

GAM partial effect line does not overlap raw data

Related Question