Generalized Additive Model – Is My GAM Modeling Longitudinal Change Appropriately?

gamm4generalized linear modelgeneralized-additive-modelmgcvr

My data includes the following:

  • Y which is a physiological measure
  • Age continuous
  • Gender factor, Male or Female
  • Timepoint One for each year of study visit: 1,2,3,4
  • subject each unique subject

Model aim:

  • To identify longitudinal changes of Y across Timepoints, for each subject between Gender
    • I am assuming a random slope and intercept is appropriate as each subject will have a different physiological change between Timepoints
  • I am also trying to use gratia::smooth_estimates to plot the smooth term for age between Gender

Model:

m1 <- gam(Y ~ subject + Gender + s(age, by = subject) + s(Age, by = Gender, bs = 'fs') + s(Timepoint, subject, bs ='fs') +  s(subject, bs = "re") , 
    data = DF,
    method = "REML")

However, this does not work,

Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : A term has fewer unique covariate combinations than specified maximum degrees of freedom

So I have to use:

m1.reduced <- gam(Y ~ Gender + s(Age, by = Gender, bs = 'fs') + s(subject, bs = "re"),
          method = "REML", 
          data = DF_y)

Or I can also use gamm4 (which works with the plot below, though I am not sure if it is appropriately modeling what I am intending:

m2 <- gamm4(Y ~ s(Age, by = Gender, bs = 'fs') + Gender+ Timepoint,
                     random = ~ (1 + Timepoint | subject),
                     data = DF)

Plot:

sm <- smooth_estimates(m1) %>% 
  add_confint()

ggplot(data = sm, aes(y = est, x = Age)) +
  geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci, fill = Gender),
                alpha = 0.25) +
  geom_line(aes(color = Gender),  size = 1.5) +
  geom_point(data = DF, aes(Age, Y, color = Gender), alpha = .25) +
  geom_line(data = DF, aes(Age, Y, group = subject), alpha = 0.25) +
  ylab("Partial effect") +
  xlab("Age"))
  

I am trying to replicate a plot like this (source): where each color is a gender and the predicted longitudinal trajectory is plotted across age.

plot

This is my model 2 (gamm4) plot that is providing what I want.

However, is my gam model modeling and plotting longitudinal change correctly?
m2

sample data:

structure(list(subject= c(33714L, 35377L, 40556L, 40798L, 40800L, 
40815L, 50848L, 52183L, 52461L, 53320L, 53873L, 54206L, 54581L, 
55122L, 55267L, 55462L, 55612L, 55920L, 56022L, 56307L, 56420L, 
56679L, 57405L, 57480L, 57725L, 57809L, 58004L, 58215L, 58229L, 
59326L, 59327L, 59865L, 60099L, 60100L, 60280L, 60384L, 60429L, 
60493L, 60503L, 60603L, 60664L, 60846L, 61415L, 61749L, 61883L, 
62081L, 62983L, 63327L, 63329L, 64418L, 64507L, 64596L, 65178L, 
65250L, 65802L, 65975L, 65978L, 66396L, 66572L, 66589L, 74034L, 
74427L, 74607L, 74952L, 75732L, 76574L, 76595L, 76755L, 76759L, 
77203L, 77453L, 77668L, 81064L, 81065L, 33714L, 35377L, 40556L, 
40798L, 40800L, 40815L, 50848L, 52183L, 52461L, 53320L, 53873L, 
54206L, 54581L, 55122L, 55267L, 55462L, 55612L, 55920L, 56022L, 
56307L, 56420L, 56679L, 57405L, 57480L, 57725L, 57809L, 58004L, 
58215L, 58229L, 59326L, 59327L, 59865L, 60099L, 60100L, 60280L, 
60384L, 60429L, 60493L, 60503L, 60603L, 60664L, 60846L, 61415L, 
61749L, 61883L, 62081L, 62983L, 63327L, 63329L, 64418L, 64507L, 
64596L, 65178L, 65250L, 65802L, 65975L, 65978L, 66396L, 66572L, 
66589L, 74034L, 74427L, 74607L, 74952L, 75732L, 76574L, 76595L, 
76755L, 76759L, 77203L, 77453L, 77668L, 81064L, 81065L, 33714L, 
35377L, 40556L, 40798L, 40800L, 40815L, 50848L, 52183L, 52461L, 
53320L, 53873L, 54206L, 54581L, 55122L, 55267L, 55462L, 55612L, 
55920L, 56022L, 56307L, 56420L, 56679L, 57405L, 57480L, 57725L, 
57809L, 58004L, 58215L, 58229L, 59326L, 59327L, 59865L, 60099L, 
60100L, 60280L, 60384L, 60429L, 60493L, 60503L, 60603L, 60664L, 
60846L, 61415L, 61749L, 61883L, 62081L, 62983L, 63327L, 63329L, 
64418L, 64507L, 64596L, 65178L, 65250L, 65802L, 65975L, 65978L, 
66396L, 66572L, 66589L, 74034L, 74427L, 74607L, 74952L, 75732L, 
76574L, 76595L, 76755L, 76759L, 77203L, 77453L, 77668L, 81064L, 
81065L), Gender = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 
2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 
2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 
2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
2L, 1L, 2L, 1L, 1L), .Label = c("Male", "Female"), class = "factor"), 
    Age = c(15L, 15L, 9L, 11L, 16L, 9L, 16L, 16L, 14L, 8L, 6L, 
    14L, 10L, 15L, 13L, 15L, 8L, 9L, 9L, 8L, 9L, 9L, 13L, 10L, 
    7L, 8L, 8L, 6L, 15L, 8L, 11L, 14L, 12L, 10L, 16L, 12L, 10L, 
    6L, 13L, 11L, 12L, 13L, 10L, 13L, 14L, 12L, 17L, 9L, 12L, 
    11L, 10L, 12L, 10L, 10L, 14L, 16L, 15L, 14L, 14L, 13L, 10L, 
    12L, 9L, 9L, 16L, 10L, 14L, 15L, 13L, 15L, 13L, 13L, 8L, 
    11L, 16L, 16L, 11L, 13L, 18L, 10L, 18L, 18L, 15L, 10L, 8L, 
    15L, 12L, 16L, 14L, 16L, 9L, 11L, 11L, 10L, 10L, 11L, 14L, 
    12L, 8L, 9L, 9L, 8L, 16L, 9L, 13L, 16L, 13L, 12L, 18L, 13L, 
    11L, 8L, 14L, 12L, 13L, 14L, 11L, 14L, 15L, 14L, 18L, 11L, 
    14L, 12L, 12L, 13L, 11L, 11L, 15L, 17L, 16L, 15L, 15L, 14L, 
    11L, 13L, 10L, 11L, 18L, 11L, 15L, 16L, 14L, 17L, 14L, 14L, 
    9L, 12L, 18L, 18L, 12L, 14L, 19L, 11L, 19L, 19L, 16L, 11L, 
    9L, 17L, 13L, 18L, 16L, 18L, 11L, 12L, 12L, 11L, 11L, 12L, 
    16L, 13L, 9L, 11L, 10L, 9L, 17L, 11L, 14L, 17L, 14L, 13L, 
    19L, 15L, 12L, 9L, 15L, 14L, 14L, 15L, 12L, 16L, 17L, 15L, 
    20L, 12L, 15L, 13L, 13L, 14L, 12L, 12L, 16L, 19L, 18L, 16L, 
    16L, 15L, 12L, 14L, 11L, 11L, 19L, 12L, 16L, 17L, 15L, 18L, 
    15L, 15L, 10L, 13L), Y= c(-3.91040964443843, -4.1357309415226, 
    -3.15711439697149, -5.3057953682613, -5.64125379300701, -5.14501797062188, 
    0.244858660867345, 0.0760011434854038, -1.38050716781916, 
    -0.765398953185992, -1.47720169386702, -1.64309892713909, 
    -0.396819316883117, -1.36975793945299, -1.80785572309015, 
    -0.401914887891913, -0.811404680008261, -2.4492910546021, 
    -0.620903118150862, -1.10304119407832, -0.683772758549858, 
    -1.47159656575735, -0.823876696667884, 0.229547683217108, 
    -1.91347963071532, -1.05594355918274, -2.37962731952471, 
    -1.34893889218848, -0.385827442278431, -0.980092345023242, 
    -0.181058281596406, -0.832975930612162, -1.9959308225624, 
    -0.12787022611412, -0.205371434695515, -0.609353157197594, 
    -0.477183751078979, -1.53245224237668, -0.137187841673061, 
    -1.11978378453579, -1.38506891710323, -0.665622819909006, 
    -0.858065551674517, -0.935493966384356, -0.444790478237349, 
    -0.0888284463372063, 0.793433276899957, -1.37118955226023, 
    -1.27752810419379, 0.571630350274245, 0.38391922015976, -0.585743678190174, 
    -0.639562613892594, 0.637702921021628, 1.02283103079116, 
    0.509051885361466, 0.256796855802238, -0.827467860997893, 
    1.25436407158129, 0.797121499725372, -1.46470541258355, 0.72639012119852, 
    -1.77912640845008, 1.31912635264069, 0.135328148802092, -0.522413009938002, 
    0.680529982119362, 0.068624697834574, -0.598312753345204, 
    0.141370040141094, -0.392378890718311, 1.88269650621348, 
    -3.0968410713246, -0.0622101539722123, -4.1452184094485, 
    -0.954638754603117, -0.106347504757922, -0.716821176235476, 
    0.33910245990621, -1.65370256776216, 0.666505029533247, 1.11178514183042, 
    0.0370564222038934, 0.0466166840013479, -0.557572449898689, 
    -0.961796818639283, 0.936133529770083, -0.283139554139275, 
    -0.111006312537392, 0.655416096433155, 0.253351279215336, 
    -1.27117077274472, 0.821677299062997, 1.30061244464207, -0.571112110007776, 
    -0.978927643078376, -0.0939968112175565, 1.01351341523222, 
    -0.482667556069396, 0.416191169882131, -1.13218300732386, 
    0.0109719515636322, 0.570902411558704, 0.251628490921888, 
    0.757376045856767, -0.286245425992256, -0.850640576775985, 
    0.323694423760568, 0.33667599752107, 0.489882832518856, 0.0192219236731108, 
    -0.267852841112888, 1.81167395220041, -0.0196014744891418, 
    -0.0975879755475657, 1.11193072957353, 1.43135023795345, 
    -0.0901872652728838, 0.562191411596049, 0.0591857591563773, 
    1.10166679368438, -0.140293713526042, 0.0486791770287176, 
    0.34512008662136, 0.818328780971503, -0.238419852381131, 
    -0.368114266866903, 1.03486628422146, 0.539091489689507, 
    -0.233227222876931, -0.351614322647946, -0.108579850152251, 
    1.42892377556831, 0.96726504217144, -0.436055213650844, 0.565831105173759, 
    -1.54960733143962, 1.100259445501, -0.614861226811863, -0.361659876922429, 
    1.40706134947819, -0.552573937385301, 0.105482661464863, 
    -0.374326010572864, -0.0292830594058542, 1.98545718822419, 
    -1.19221368673224, 0.760991474810628, -0.539568099000947, 
    -0.0739542319162944, 0.34373700306183, 0.053071073945822, 
    0.0606901658351647, -0.211219209043705, 0.216590374080456, 
    1.00007081361854, -0.495649129829898, 0.0867746364754286, 
    -1.67723925289802, -0.826836980777755, 1.18863120556783, 
    -0.795681203752549, -0.650869928607352, 0.333545861044237, 
    0.63103014946249, -1.75748236397463, -0.118601139802882, 
    1.51474775013074, 0.132537717059181, -0.677973513449372, 
    0.467947612557183, 1.56786301174147, -0.118528345931329, 
    0.554596584330558, -0.59224659738235, -0.543498968064875, 
    -0.380440695783417, 0.0643541240367275, -0.210830975062082, 
    0.503471021875644, -0.660672836643319, 1.50676468888363, 
    0.780839937121078, -0.116635705270918, -0.240385286913095, 
    0.00044110481212036, 1.4571920623552, -0.350401091455376, 
    0.0850033189342734, 0.197882349091022, 0.726511444317778, 
    -0.331741595713643, 0.837497833814114, 0.609264781867777, 
    0.532151807268005, 0.65446977610295, -0.581909867621652, 
    0.322942220421174, 1.52993740466172, -0.264455793773691, 
    -0.0568476721010507, 1.05299195823846, 0.915435805624833, 
    0.297609953120304, -0.114257772133482, 0.248013061968027, 
    1.64822744593733, 1.26800079018578, -0.0459528559917708, 
    0.720784993088846, -1.05679282101754, 1.38345187047077, -0.64414862780051, 
    -1.25411274217718, 0.177548594303543, -0.257734492966852, 
    -0.818028922319694, 0.409736779937656, -0.216411838547905, 
    1.32536236097051, -0.500938817829506, 0.803770006660658), 
    Timepoint= c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 2, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 2, 3, 2, 2, 2, 2, 3, 
    3, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3, 3, 2, 3, 3, 2, 2, 
    3, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 2, 2, 2, 2, 
    2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2, 3, 2, 2, 2, 2, 4, 
    4, 4, 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
    3, 4, 4, 4, 3, 4, 3, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 4, 3, 
    4, 3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3, 3, 3, 4, 4, 3, 
    3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 4, 3, 3, 3, 3)), row.names = c(NA, 
-222L), class = c("tbl_df", "tbl", "data.frame"))

Best Answer

The first model fails because you ask for too many coefficients; you ask for Subject specific intercepts twice (!) and try to fit 9 basis functions per subject in the s(Timepoint, Subject, bs = "fs") term.

  1. The "fs" smooth of Timepoint by Subject already contains a random intercept per Subject: you don't need to add another separate random intercept via the s(Subject, bs = "re") term. You're just trying to estimate an extra nlevels(Subject) parameters that are confounded with the "fs" terms random intercept parameters, and nlevels(Subject) of these parameters are going to be shrunk to 0.

    Whenever you include "fs" smooths, you typically don't need a separate random intercept for the grouping variable of the "fs" smooth.

  2. Looking at your plot, in many case you <<9 observations per subject. Unless you have >>9 observations for other subjects you can't estimate 9 parameters per subject; you don't have the data to support such rich smooths.

  3. The s(Age, by = Gender, bs = "fs") really isn't doing what you think it's doing. There's no grouping variable here (so this should arguably raise an error). This just needs to be s(Age, by = Gender).

  4. You compound the problem from 1. by adding another nlevels(Subject) - 1 parameters for the subject term by estimated parametric effects. While we say you "have" to include the by variable as parametric effect, typically what we mean is you need to have something in the model for the group means of the factor by term(s). The random intercept in either the s(Timepoint, Subject, bs = "fs") smooth or the s(Subject, bs = "re") smooth would account for these group means of Subject; you don't need yet more terms.

Get rid of Subject from the parametric effects, drop the Subject random intercept, and think through how many parameters you can afford to expend on the subject-specific time curves (in the "fs" Timepoint` smooth).

Why do you estimate time curves by subjects as having the same wiggliness but age by subject curves are each allowed their own wiggliness? You now have to choose values for nlevels(Subject) extra smoothness parameters. It also doesn't make much philosophical sense to treat one as essentially random and the other fixed for the same grouping variable.

A much simpler model for the first case would be:

m1 <- gam(Y ~ Gender + s(Age, by = Gender) +
            s(Timepoint, Subject, bs = "fs", k = 10), 
          data = DF,
          method = "REML")

and shift k smaller if you can't fit even this simpler model.

If you want subject-specific Age effects, you could add that to the model as so:

m1a <- gam(Y ~ Gender + s(Age, by = Gender) +
             s(Timepoint, Subject, bs = "fs", k = 10) +
             s(Age, Subject, bs = "fs"), 
           data = DF,
           method = "REML")

but at this point we probably need to start thinking more carefully as you'd want to be sure you aren't estimating another nlevels(Subject) random intercepts and check what identifiability constraints are being applied here.

With the exception of considering an auto correlated process in the model (Frank's suggestion), a GAM such as the one I show above would be a reasonable starting point for modelling longitudinal data via a GAM.

You're never going to get a plot like the one you showed from the paper by using smooth_estimates(). This is evaluating the partial effect of each smooth, while the plot from the paper seems to be on the response scale (so it would include the intercept). You could achieve this, kind of, by just adding the correct parametric effects to the est column (and the upper and lower intervals), but you'd have to get the right parametric effects added to the right rows, and you would have too-narrow an interval because you aren't really including the uncertainties in those parametric effects into the std. error.

Beyond that, you can't just pass the object of smooth_estimates() to ggplot() and expect it to do anything useful. You returned values for all smooths in your model in the object sm, so any row with an Age value that isn't NA will end up getting plotted somehow.

Better would be to get predictions from your model for the scenarios you want. To do that you would be forced to write down what you actually want to achieve, not just tell us what you want it to look like that image (I'm not reading a paper's methods section just to intuit what you want to show and what the authors of that paper did to generate their plot).

You seem to want to plot estimates of Age by Gender at the population level. So you will want something like:

new_df <- with(df,
    expand.grid(Subject = factor(levels(Subject)[1], levels = levels(Subject)),
                Age = gratia::seq_min_max(Age, n = 100),
                Gender = factor(levels(Gender), levels = levels(Gender)),
                Timepoint = median(Timepoint)))

where you are explicitly conditioning on the Age and Gender values, and we'll exclude the random smooth effects due to Subject and Timepoint and Age from my m1a (assuming you can fit that) by use of the exclude argument to predict.gam().

Top get the predicted values use gratia::fitted_values() unless you want to do everything by hand yourself, in which case use predict(m1a, new data = new_df, exclude = ...., se.fit = TRUE) etc with exclude set as below:

fv <- fitted_values(m1a, data = new_df,
                    exclude = c("s(Timepoint,Subject)",
                                "s(Age,Subject)"))

where the things we pass to exclude are the labels of the smooths that relate to Subject level that we want to ignore as you seem to want population level effects. You can get these labels from gratia::smooths(m1a) if you don't know how {mgcv} creates labels for it's smooths.

At this point you might want to rethink your model structure a bit to include population (Pedersen et al 2019 PeerJ call these "global" smooths) effects for Timepoint as the only thing account for Timepoint is a random smooth term. Even so, you'd still have to condition the predictions on the specified value of Timepoint (here I use the median time point but you can plug in whatever value works for you).

Your gamm4() makes similar mistakes (the bs = "fs" is redundant on the s(Age, by = Gender, bs = "fs") smooth and why do you now model time as a linear parametric effect of Timepoint when in the gam you had subject-specific random smooths of Timepoint for these effects? If you wanted this in the gam() why not the gamm4()?

These models are so flexible and there are so many options, one needs to restrain oneself from trying everything in a model because you think you need it. Think about the model (effects) you want fit, write it down in words, bullet point form. Then write out a model formula in R's notation that you think covers each bullet point. Then post that in your question if you want to know if this is a suitable model for your data.

Related Question