Generalized Additive Model – Is My GAM Modeling Longitudinal Change Appropriately?

gamm4generalized linear modelgeneralized-additive-modelmgcvr

My data includes the following:

Y which is a physiological measure
Age continuous
Gender factor, Male or Female
Timepoint One for each year of study visit: 1,2,3,4
subject each unique subject

Model aim:

To identify longitudinal changes of Y across Timepoints, for each subject between Gender
- I am assuming a random slope and intercept is appropriate as each subject will have a different physiological change between Timepoints
I am also trying to use gratia::smooth_estimates to plot the smooth term for age between Gender

Model:

m1 <- gam(Y ~ subject + Gender + s(age, by = subject) + s(Age, by = Gender, bs = 'fs') + s(Timepoint, subject, bs ='fs') +  s(subject, bs = "re") , 
    data = DF,
    method = "REML")

However, this does not work,

Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : A term has fewer unique covariate combinations than specified maximum degrees of freedom

So I have to use:

m1.reduced <- gam(Y ~ Gender + s(Age, by = Gender, bs = 'fs') + s(subject, bs = "re"),
          method = "REML", 
          data = DF_y)

Or I can also use gamm4 (which works with the plot below, though I am not sure if it is appropriately modeling what I am intending:

m2 <- gamm4(Y ~ s(Age, by = Gender, bs = 'fs') + Gender+ Timepoint,
                     random = ~ (1 + Timepoint | subject),
                     data = DF)

Plot:

sm <- smooth_estimates(m1) %>% 
  add_confint()

ggplot(data = sm, aes(y = est, x = Age)) +
  geom_ribbon(aes(ymin = lower_ci, ymax = upper_ci, fill = Gender),
                alpha = 0.25) +
  geom_line(aes(color = Gender),  size = 1.5) +
  geom_point(data = DF, aes(Age, Y, color = Gender), alpha = .25) +
  geom_line(data = DF, aes(Age, Y, group = subject), alpha = 0.25) +
  ylab("Partial effect") +
  xlab("Age"))

I am trying to replicate a plot like this (source): where each color is a gender and the predicted longitudinal trajectory is plotted across age.

This is my model 2 (gamm4) plot that is providing what I want.

However, is my gam model modeling and plotting longitudinal change correctly?

sample data:

structure(list(subject= c(33714L, 35377L, 40556L, 40798L, 40800L, 
40815L, 50848L, 52183L, 52461L, 53320L, 53873L, 54206L, 54581L, 
55122L, 55267L, 55462L, 55612L, 55920L, 56022L, 56307L, 56420L, 
56679L, 57405L, 57480L, 57725L, 57809L, 58004L, 58215L, 58229L, 
59326L, 59327L, 59865L, 60099L, 60100L, 60280L, 60384L, 60429L, 
60493L, 60503L, 60603L, 60664L, 60846L, 61415L, 61749L, 61883L, 
62081L, 62983L, 63327L, 63329L, 64418L, 64507L, 64596L, 65178L, 
65250L, 65802L, 65975L, 65978L, 66396L, 66572L, 66589L, 74034L, 
74427L, 74607L, 74952L, 75732L, 76574L, 76595L, 76755L, 76759L, 
77203L, 77453L, 77668L, 81064L, 81065L, 33714L, 35377L, 40556L, 
40798L, 40800L, 40815L, 50848L, 52183L, 52461L, 53320L, 53873L, 
54206L, 54581L, 55122L, 55267L, 55462L, 55612L, 55920L, 56022L, 
56307L, 56420L, 56679L, 57405L, 57480L, 57725L, 57809L, 58004L, 
58215L, 58229L, 59326L, 59327L, 59865L, 60099L, 60100L, 60280L, 
60384L, 60429L, 60493L, 60503L, 60603L, 60664L, 60846L, 61415L, 
61749L, 61883L, 62081L, 62983L, 63327L, 63329L, 64418L, 64507L, 
64596L, 65178L, 65250L, 65802L, 65975L, 65978L, 66396L, 66572L, 
66589L, 74034L, 74427L, 74607L, 74952L, 75732L, 76574L, 76595L, 
76755L, 76759L, 77203L, 77453L, 77668L, 81064L, 81065L, 33714L, 
35377L, 40556L, 40798L, 40800L, 40815L, 50848L, 52183L, 52461L, 
53320L, 53873L, 54206L, 54581L, 55122L, 55267L, 55462L, 55612L, 
55920L, 56022L, 56307L, 56420L, 56679L, 57405L, 57480L, 57725L, 
57809L, 58004L, 58215L, 58229L, 59326L, 59327L, 59865L, 60099L, 
60100L, 60280L, 60384L, 60429L, 60493L, 60503L, 60603L, 60664L, 
60846L, 61415L, 61749L, 61883L, 62081L, 62983L, 63327L, 63329L, 
64418L, 64507L, 64596L, 65178L, 65250L, 65802L, 65975L, 65978L, 
66396L, 66572L, 66589L, 74034L, 74427L, 74607L, 74952L, 75732L, 
76574L, 76595L, 76755L, 76759L, 77203L, 77453L, 77668L, 81064L, 
81065L), Gender = structure(c(1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 
2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 
2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 
2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 
2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 
2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
2L, 1L, 2L, 1L, 1L), .Label = c("Male", "Female"), class = "factor"), 
    Age = c(15L, 15L, 9L, 11L, 16L, 9L, 16L, 16L, 14L, 8L, 6L, 
    14L, 10L, 15L, 13L, 15L, 8L, 9L, 9L, 8L, 9L, 9L, 13L, 10L, 
    7L, 8L, 8L, 6L, 15L, 8L, 11L, 14L, 12L, 10L, 16L, 12L, 10L, 
    6L, 13L, 11L, 12L, 13L, 10L, 13L, 14L, 12L, 17L, 9L, 12L, 
    11L, 10L, 12L, 10L, 10L, 14L, 16L, 15L, 14L, 14L, 13L, 10L, 
    12L, 9L, 9L, 16L, 10L, 14L, 15L, 13L, 15L, 13L, 13L, 8L, 
    11L, 16L, 16L, 11L, 13L, 18L, 10L, 18L, 18L, 15L, 10L, 8L, 
    15L, 12L, 16L, 14L, 16L, 9L, 11L, 11L, 10L, 10L, 11L, 14L, 
    12L, 8L, 9L, 9L, 8L, 16L, 9L, 13L, 16L, 13L, 12L, 18L, 13L, 
    11L, 8L, 14L, 12L, 13L, 14L, 11L, 14L, 15L, 14L, 18L, 11L, 
    14L, 12L, 12L, 13L, 11L, 11L, 15L, 17L, 16L, 15L, 15L, 14L, 
    11L, 13L, 10L, 11L, 18L, 11L, 15L, 16L, 14L, 17L, 14L, 14L, 
    9L, 12L, 18L, 18L, 12L, 14L, 19L, 11L, 19L, 19L, 16L, 11L, 
    9L, 17L, 13L, 18L, 16L, 18L, 11L, 12L, 12L, 11L, 11L, 12L, 
    16L, 13L, 9L, 11L, 10L, 9L, 17L, 11L, 14L, 17L, 14L, 13L, 
    19L, 15L, 12L, 9L, 15L, 14L, 14L, 15L, 12L, 16L, 17L, 15L, 
    20L, 12L, 15L, 13L, 13L, 14L, 12L, 12L, 16L, 19L, 18L, 16L, 
    16L, 15L, 12L, 14L, 11L, 11L, 19L, 12L, 16L, 17L, 15L, 18L, 
    15L, 15L, 10L, 13L), Y= c(-3.91040964443843, -4.1357309415226, 
    -3.15711439697149, -5.3057953682613, -5.64125379300701, -5.14501797062188, 
    0.244858660867345, 0.0760011434854038, -1.38050716781916, 
    -0.765398953185992, -1.47720169386702, -1.64309892713909, 
    -0.396819316883117, -1.36975793945299, -1.80785572309015, 
    -0.401914887891913, -0.811404680008261, -2.4492910546021, 
    -0.620903118150862, -1.10304119407832, -0.683772758549858, 
    -1.47159656575735, -0.823876696667884, 0.229547683217108, 
    -1.91347963071532, -1.05594355918274, -2.37962731952471, 
    -1.34893889218848, -0.385827442278431, -0.980092345023242, 
    -0.181058281596406, -0.832975930612162, -1.9959308225624, 
    -0.12787022611412, -0.205371434695515, -0.609353157197594, 
    -0.477183751078979, -1.53245224237668, -0.137187841673061, 
    -1.11978378453579, -1.38506891710323, -0.665622819909006, 
    -0.858065551674517, -0.935493966384356, -0.444790478237349, 
    -0.0888284463372063, 0.793433276899957, -1.37118955226023, 
    -1.27752810419379, 0.571630350274245, 0.38391922015976, -0.585743678190174, 
    -0.639562613892594, 0.637702921021628, 1.02283103079116, 
    0.509051885361466, 0.256796855802238, -0.827467860997893, 
    1.25436407158129, 0.797121499725372, -1.46470541258355, 0.72639012119852, 
    -1.77912640845008, 1.31912635264069, 0.135328148802092, -0.522413009938002, 
    0.680529982119362, 0.068624697834574, -0.598312753345204, 
    0.141370040141094, -0.392378890718311, 1.88269650621348, 
    -3.0968410713246, -0.0622101539722123, -4.1452184094485, 
    -0.954638754603117, -0.106347504757922, -0.716821176235476, 
    0.33910245990621, -1.65370256776216, 0.666505029533247, 1.11178514183042, 
    0.0370564222038934, 0.0466166840013479, -0.557572449898689, 
    -0.961796818639283, 0.936133529770083, -0.283139554139275, 
    -0.111006312537392, 0.655416096433155, 0.253351279215336, 
    -1.27117077274472, 0.821677299062997, 1.30061244464207, -0.571112110007776, 
    -0.978927643078376, -0.0939968112175565, 1.01351341523222, 
    -0.482667556069396, 0.416191169882131, -1.13218300732386, 
    0.0109719515636322, 0.570902411558704, 0.251628490921888, 
    0.757376045856767, -0.286245425992256, -0.850640576775985, 
    0.323694423760568, 0.33667599752107, 0.489882832518856, 0.0192219236731108, 
    -0.267852841112888, 1.81167395220041, -0.0196014744891418, 
    -0.0975879755475657, 1.11193072957353, 1.43135023795345, 
    -0.0901872652728838, 0.562191411596049, 0.0591857591563773, 
    1.10166679368438, -0.140293713526042, 0.0486791770287176, 
    0.34512008662136, 0.818328780971503, -0.238419852381131, 
    -0.368114266866903, 1.03486628422146, 0.539091489689507, 
    -0.233227222876931, -0.351614322647946, -0.108579850152251, 
    1.42892377556831, 0.96726504217144, -0.436055213650844, 0.565831105173759, 
    -1.54960733143962, 1.100259445501, -0.614861226811863, -0.361659876922429, 
    1.40706134947819, -0.552573937385301, 0.105482661464863, 
    -0.374326010572864, -0.0292830594058542, 1.98545718822419, 
    -1.19221368673224, 0.760991474810628, -0.539568099000947, 
    -0.0739542319162944, 0.34373700306183, 0.053071073945822, 
    0.0606901658351647, -0.211219209043705, 0.216590374080456, 
    1.00007081361854, -0.495649129829898, 0.0867746364754286, 
    -1.67723925289802, -0.826836980777755, 1.18863120556783, 
    -0.795681203752549, -0.650869928607352, 0.333545861044237, 
    0.63103014946249, -1.75748236397463, -0.118601139802882, 
    1.51474775013074, 0.132537717059181, -0.677973513449372, 
    0.467947612557183, 1.56786301174147, -0.118528345931329, 
    0.554596584330558, -0.59224659738235, -0.543498968064875, 
    -0.380440695783417, 0.0643541240367275, -0.210830975062082, 
    0.503471021875644, -0.660672836643319, 1.50676468888363, 
    0.780839937121078, -0.116635705270918, -0.240385286913095, 
    0.00044110481212036, 1.4571920623552, -0.350401091455376, 
    0.0850033189342734, 0.197882349091022, 0.726511444317778, 
    -0.331741595713643, 0.837497833814114, 0.609264781867777, 
    0.532151807268005, 0.65446977610295, -0.581909867621652, 
    0.322942220421174, 1.52993740466172, -0.264455793773691, 
    -0.0568476721010507, 1.05299195823846, 0.915435805624833, 
    0.297609953120304, -0.114257772133482, 0.248013061968027, 
    1.64822744593733, 1.26800079018578, -0.0459528559917708, 
    0.720784993088846, -1.05679282101754, 1.38345187047077, -0.64414862780051, 
    -1.25411274217718, 0.177548594303543, -0.257734492966852, 
    -0.818028922319694, 0.409736779937656, -0.216411838547905, 
    1.32536236097051, -0.500938817829506, 0.803770006660658), 
    Timepoint= c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 2, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 2, 3, 2, 2, 2, 2, 3, 
    3, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3, 3, 2, 3, 3, 2, 2, 
    3, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 2, 2, 2, 2, 
    2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2, 3, 2, 2, 2, 2, 4, 
    4, 4, 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
    3, 4, 4, 4, 3, 4, 3, 4, 3, 4, 4, 4, 3, 4, 4, 4, 3, 4, 3, 
    4, 3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 4, 3, 3, 3, 3, 4, 4, 3, 
    3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 4, 3, 3, 3, 3)), row.names = c(NA, 
-222L), class = c("tbl_df", "tbl", "data.frame"))

Best Answer

The first model fails because you ask for too many coefficients; you ask for Subject specific intercepts twice (!) and try to fit 9 basis functions per subject in the s(Timepoint, Subject, bs = "fs") term.

The "fs" smooth of Timepoint by Subject already contains a random intercept per Subject: you don't need to add another separate random intercept via the s(Subject, bs = "re") term. You're just trying to estimate an extra nlevels(Subject) parameters that are confounded with the "fs" terms random intercept parameters, and nlevels(Subject) of these parameters are going to be shrunk to 0.

Whenever you include "fs" smooths, you typically don't need a separate random intercept for the grouping variable of the "fs" smooth.
Looking at your plot, in many case you <<9 observations per subject. Unless you have >>9 observations for other subjects you can't estimate 9 parameters per subject; you don't have the data to support such rich smooths.
The s(Age, by = Gender, bs = "fs") really isn't doing what you think it's doing. There's no grouping variable here (so this should arguably raise an error). This just needs to be s(Age, by = Gender).
You compound the problem from 1. by adding another nlevels(Subject) - 1 parameters for the subject term by estimated parametric effects. While we say you "have" to include the by variable as parametric effect, typically what we mean is you need to have something in the model for the group means of the factor by term(s). The random intercept in either the s(Timepoint, Subject, bs = "fs") smooth or the s(Subject, bs = "re") smooth would account for these group means of Subject; you don't need yet more terms.

Get rid of Subject from the parametric effects, drop the Subject random intercept, and think through how many parameters you can afford to expend on the subject-specific time curves (in the "fs" Timepoint` smooth).

Why do you estimate time curves by subjects as having the same wiggliness but age by subject curves are each allowed their own wiggliness? You now have to choose values for nlevels(Subject) extra smoothness parameters. It also doesn't make much philosophical sense to treat one as essentially random and the other fixed for the same grouping variable.

A much simpler model for the first case would be:

m1 <- gam(Y ~ Gender + s(Age, by = Gender) +
            s(Timepoint, Subject, bs = "fs", k = 10), 
          data = DF,
          method = "REML")

and shift k smaller if you can't fit even this simpler model.

If you want subject-specific Age effects, you could add that to the model as so:

m1a <- gam(Y ~ Gender + s(Age, by = Gender) +
             s(Timepoint, Subject, bs = "fs", k = 10) +
             s(Age, Subject, bs = "fs"), 
           data = DF,
           method = "REML")

but at this point we probably need to start thinking more carefully as you'd want to be sure you aren't estimating another nlevels(Subject) random intercepts and check what identifiability constraints are being applied here.

With the exception of considering an auto correlated process in the model (Frank's suggestion), a GAM such as the one I show above would be a reasonable starting point for modelling longitudinal data via a GAM.

You're never going to get a plot like the one you showed from the paper by using smooth_estimates(). This is evaluating the partial effect of each smooth, while the plot from the paper seems to be on the response scale (so it would include the intercept). You could achieve this, kind of, by just adding the correct parametric effects to the est column (and the upper and lower intervals), but you'd have to get the right parametric effects added to the right rows, and you would have too-narrow an interval because you aren't really including the uncertainties in those parametric effects into the std. error.

Beyond that, you can't just pass the object of smooth_estimates() to ggplot() and expect it to do anything useful. You returned values for all smooths in your model in the object sm, so any row with an Age value that isn't NA will end up getting plotted somehow.

Better would be to get predictions from your model for the scenarios you want. To do that you would be forced to write down what you actually want to achieve, not just tell us what you want it to look like that image (I'm not reading a paper's methods section just to intuit what you want to show and what the authors of that paper did to generate their plot).

You seem to want to plot estimates of Age by Gender at the population level. So you will want something like:

new_df <- with(df,
    expand.grid(Subject = factor(levels(Subject)[1], levels = levels(Subject)),
                Age = gratia::seq_min_max(Age, n = 100),
                Gender = factor(levels(Gender), levels = levels(Gender)),
                Timepoint = median(Timepoint)))

where you are explicitly conditioning on the Age and Gender values, and we'll exclude the random smooth effects due to Subject and Timepoint and Age from my m1a (assuming you can fit that) by use of the exclude argument to predict.gam().

Top get the predicted values use gratia::fitted_values() unless you want to do everything by hand yourself, in which case use predict(m1a, new data = new_df, exclude = ...., se.fit = TRUE) etc with exclude set as below:

fv <- fitted_values(m1a, data = new_df,
                    exclude = c("s(Timepoint,Subject)",
                                "s(Age,Subject)"))

where the things we pass to exclude are the labels of the smooths that relate to Subject level that we want to ignore as you seem to want population level effects. You can get these labels from gratia::smooths(m1a) if you don't know how {mgcv} creates labels for it's smooths.

At this point you might want to rethink your model structure a bit to include population (Pedersen et al 2019 PeerJ call these "global" smooths) effects for Timepoint as the only thing account for Timepoint is a random smooth term. Even so, you'd still have to condition the predictions on the specified value of Timepoint (here I use the median time point but you can plug in whatever value works for you).

Your gamm4() makes similar mistakes (the bs = "fs" is redundant on the s(Age, by = Gender, bs = "fs") smooth and why do you now model time as a linear parametric effect of Timepoint when in the gam you had subject-specific random smooths of Timepoint for these effects? If you wanted this in the gam() why not the gamm4()?

These models are so flexible and there are so many options, one needs to restrain oneself from trying everything in a model because you think you need it. Think about the model (effects) you want fit, write it down in words, bullet point form. Then write out a model formula in R's notation that you think covers each bullet point. Then post that in your question if you want to know if this is a suitable model for your data.

Related Solutions

Solved – Correlation between two variables measured on a “strongly agree” to “strongly disagree” scale

As you have ordinal factors, means are not so useful. You could use a $\chi^2$ test and/or Spearman correlation to find if the two values are correlated.

Commands:

chisq.test(analysis3$groups,analysis3$quickly) ,

and after converting your "quickly" strings to factors, reordering and extracting the levels to a numeric vector, you can apply Spearman correlation:

analysis3$qui_fact<- as.factor(analysis3$quickly)

levels(analysis$qui_fact) #(alphabetical levels)

analysis$qui_fact<- factor(analysis$qui_fact,levels(analysis$qui_fact)[c(4,1,3,2,5)]) #reorder as needed

analysis$qui_num<- as.numeric(analysis$qui_fact)

cor.test(analysis$groups,analysis$qui_num,alt="two.sided",method="spearman",conf.level=.99)

Time Series Modeling – GAM with Many Binary Factors

You're assuming that the audience here knows quite a lot about your particular subject area and specific task. I'm also going to assume your pseudocode was just a thinko as it is gam(y ~ s(x) ...) not y ~ gam(s(x) ...).

A first problem is that in any model with a by factor smooth you need to include the factor as a "fixed" effect to include the group means. So if one went for a by factor model you'd need something like

gam(glucose ~ bread + butter + breakfast +
      s(time) +
      s(time, by = bread) +
      s(time, by = butter) +
      s(time, by = breakfast),
    data = meals, method = "REML", family = Gamma(link = "log"))

where I would (instead of the standard factor by model, but note I don't think this is what you want - read on) code bread, butter, breakfast etc as ordered factors. That way we'll get the smooths set up as a smooth of time for the all zero combination (assuming you have those?) of the factor variables, and then the by factor smooths will be parameterised as deviations from this reference level "meal food stuff combo".

In other words your s(time) if you code the factors covariates as ordered factors will represent the reference way glucose varies over time post meal, and the other smooths will reflect smooth deviations from this reference time course.

I can think of other ways in which we might have a smoothly varying effect of the food stuffs. You could do the same model as above but with standard factors; though I don't think this will get you what you want as the smooth of each food stuff by time will be some smushed up average of all the meals in which that particular food stuff was consumed.

The problem with both of these models is that we will get two smooths per food stuff, one for the not-consumed (0) case and one for the consumed case (1), and that is likely not what you want.

Another model might be to treat these food stuff variables as continuous variables and fit

gam(glucose ~ s(time, by = bread) +
      s(time, by = butter) +
      s(time, by = breakfast),
    data = meals, method = "REML", family = Gamma(link = "log"))

How this model differs is that here we are fitting

$$\text{glucose} = \beta_0 + f(\mathtt{time}) \mathtt{bread} + \cdots$$

compared with something like this for the factor by model

$$\text{glucose} = \beta_0 + \gamma_{\text{bread}} + f_{\text{no bread}}(\mathtt{time}) + f_{\text{bread}}(\mathtt{time}) + \cdots$$

It's kind of cheating, but in the first equation above, note that we are multiplying the smooth of time by the numeric value $\mathtt{bread}$

$$f(\mathtt{time}) \mathtt{bread}$$

and if $\mathtt{bread} = 0$ then we get no contribution at all from the bread, whereas in the second equation (the one for the by factor model) we will get a smooth of time base on all the meals in which which bread was not consumed and that doesn't sound useful to me.

You could create all possible combinations of food stuffs observed in the data set, create a factor from that variable and use that to fit a whole bunch of smooths, sort of like this (in pseudocode)

meals <- transform(meals,
                   food_stuff = interaction(bread, butter, breakfast, 
                                            drop = TRUE))

gam(glucose ~ food_stuff + s(time, by = food_stuff), ....)

but that could end up creating a huge number of smooths (each using 9 parameters each by default) if you have even a modest number of food stuffs.

Whether the continuous version works will depend on whether you need some indication of how blood glucose changes over time post meal or not. For example, you might want:

$$\text{glucose} = \beta_0 + f(\mathtt{time}) + f(\mathtt{time}) \mathtt{bread} + \cdots$$

and set up the smooth for the by term to have a penalty on the first derivative so that it penalises deviations from the average glucose time course:

gam(glucose ~ s(time) + s(time, by = bread, m = 1) + ....)

where the m = 1 will set the penalty to be on the first derivative, (i.e. deviations from a constant function).

What you want will depend on how you understand glucose to vary over time with different combinations of food stuffs and that's not something we can help you with unless you actually spell this out in your question. Even then you're likely to want to seek local support as this seems like a relatively complicated modelling exercise requiring some detailed input from a statistician.

Best Answer

Related Solutions

Solved – Correlation between two variables measured on a “strongly agree” to “strongly disagree” scale

Time Series Modeling – GAM with Many Binary Factors

Related Question