Mgcv – why does gam fit change so much with random effect

mgcvr

This is related to an earlier question which Gavin Simpson kindly answered:
mgcv Error in gam. Model has more coefficients than data

For these data I have 94 subjects and 123 observations and want to fit an effect for group that can vary over time. The response is non-integer and positive so I am trying a quasipoisson distribution.

A lowess plot looks like:

Lowess

If I run the following, ignoring the repeated measures by person:

# Model
gam_mod_sii <- gam(ratio_SII ~ group + s(time, bs = "cr", by = group), method = "REML", family = quasipoisson, data = ratios_wide)

# Set grid for predicted data from model
pred_df <- data.frame()
pred_df <- with(ratios_wide, 
                expand.grid(time = seq(0, max_time, length = 100), 
                            group = levels(group),
                            id = ratios_wide$id[1]))
# Predictions
pred <- predict(gam_mod_sii, newdata = pred_df, se.fit = TRUE, exclude = "s(id)", type = "response")
pred_df <- cbind(pred_df, as.data.frame(pred))
pred_df <- pred_df |> 
  mutate(fitted_lower = fit - (1.96 * se.fit),
         fitted_upper = fit + (1.96 * se.fit))

I get the following plot – which seems reasonable and in line with a curved effect for one of the groups as reflected in the lowess plot.

GAM fit without random effect

gratia::appraise(gam_mod_sii)

GAM diagnostics without random effect

Now if I attempt to include a random effect for person:

# Model
gam_mod_sii <- gam(ratio_SII ~ group + s(time, bs = "cr", by = group) + s(id, bs = "re"), method = "REML", family = quasipoisson, data = ratios_wide)

# Set grid for predicted data from model
pred_df <- data.frame()
pred_df <- with(ratios_wide, 
                expand.grid(time = seq(0, max_time, length = 100), 
                            group = levels(group),
                            id = ratios_wide$id[1]))
# Predictions
pred <- predict(gam_mod_sii, newdata = pred_df, se.fit = TRUE, exclude = "s(id)", type = "response")
pred_df <- cbind(pred_df, as.data.frame(pred))
pred_df <- pred_df |> 
  mutate(fitted_lower = fit - (1.96 * se.fit),
         fitted_upper = fit + (1.96 * se.fit))

Gam fit with random effect

gratia::appraise(gam_mod_sii)

Gam diagnostics with random effect

The diagnostic plots look a lot better for the model with the random effect. However, the fits are essentially linear. I'm just wondering if I have specified the model correctly and should this be an expected finding?

Thanks.

Best Answer

You might be better off using the tw() family to fit a Tweedie model, at least then you'll have a proper likelihood and can use things like AIC etc to compare fits.

You're generating your confidence intervals incorrectly. That they include negative values should've been a big warning sign as those values simply aren't plausible values. You should use predict(...., type = "link", se.fit = TRUE), compute the confidence interval on the link scale, and then transform the fitted values and the upper and lower limits of the confidence interval back to the response scale using the inverse of the link function (here you would just use exp() on the fitted values and upper and lower interval values).

If you have 94 subjects and only 123 observations, that implies that for most of the subjects you have a single observation. By including the random effect you are soaking up (modelling) some of the variation in the response that is due to individuals (subjects) but because variation between subjects is pretty much all you have in your data set (you have very little data to inform the within-Subject part of the model) there's little left for the effect of time to model. Hence the flat fitted lines.

Try plotting your data faceted by Subject:

ggplot(ratios_wide, aes(x = time, y = ratio_SII)) +
  geom_point() +
  facet_wrap( ~ id)

and see if there is much in the way of change over time in those plots. That should help you understand why you get such different time smooths when you include or exclude the random effect.

Also, the name of your response variable ratio_SII implies you've divided your original response variable by another value. If you did, can you explain what the original data and the thing you divided it by are / represent? This is often a mistake I see people make where they start with a count response but need to normalise it by some other variable (to account for effort or some such) and so they divide the nice integer count data they had by this value and end up having to model a now continuous variable which leads them to doing things like quasipoisson models, when they could have just stuck with the original count data, fitted a Poisson model, and used a offset to account for the thing they need to normalise their data by...

Related Question