Time Series Modeling – GAM with Many Binary Factors

gamm4generalized-additive-modelmgcvmodelingtime series

I am trying to model the blood glucose response at 15-minute intervals after a meal. The meal event has many continuous variables (like time of day, body weight, etc) and many binary variables (like butter, bread, breakfast, etc).

I have tried several approaches. One is to use all of the factors as by-variables, like this:

glucose ~ gam(s(time, by=butter) + s(time, by=bread) + s(time, by=breakfast) + s(weight) + ...)

Another option is to multiply all of the binary variables by time, so that they become smooths of their own:

glucose ~ gam(s(butter*time) + s(bread*time) + s(breakfast*time) + s(weight)

Are either of these on the right track? Is there a better way to handle all these binary factors without losing the time dependency? (E.g. the effect of butter should be smooth between 15 minutes and 30 minutes).

Another option that I am considering is decomposing the glucose signal into ~3 principal components and then training seperate GAMs for these. Then the binary factors could be treated as simple linear predictors.

Thank you for your help!

Edited to include more details as requested.

Best Answer

You're assuming that the audience here knows quite a lot about your particular subject area and specific task. I'm also going to assume your pseudocode was just a thinko as it is gam(y ~ s(x) ...) not y ~ gam(s(x) ...).

A first problem is that in any model with a by factor smooth you need to include the factor as a "fixed" effect to include the group means. So if one went for a by factor model you'd need something like

gam(glucose ~ bread + butter + breakfast +
      s(time) +
      s(time, by = bread) +
      s(time, by = butter) +
      s(time, by = breakfast),
    data = meals, method = "REML", family = Gamma(link = "log"))

where I would (instead of the standard factor by model, but note I don't think this is what you want - read on) code bread, butter, breakfast etc as ordered factors. That way we'll get the smooths set up as a smooth of time for the all zero combination (assuming you have those?) of the factor variables, and then the by factor smooths will be parameterised as deviations from this reference level "meal food stuff combo".

In other words your s(time) if you code the factors covariates as ordered factors will represent the reference way glucose varies over time post meal, and the other smooths will reflect smooth deviations from this reference time course.

I can think of other ways in which we might have a smoothly varying effect of the food stuffs. You could do the same model as above but with standard factors; though I don't think this will get you what you want as the smooth of each food stuff by time will be some smushed up average of all the meals in which that particular food stuff was consumed.

The problem with both of these models is that we will get two smooths per food stuff, one for the not-consumed (0) case and one for the consumed case (1), and that is likely not what you want.

Another model might be to treat these food stuff variables as continuous variables and fit

gam(glucose ~ s(time, by = bread) +
      s(time, by = butter) +
      s(time, by = breakfast),
    data = meals, method = "REML", family = Gamma(link = "log"))

How this model differs is that here we are fitting

$$\text{glucose} = \beta_0 + f(\mathtt{time}) \mathtt{bread} + \cdots$$

compared with something like this for the factor by model

$$\text{glucose} = \beta_0 + \gamma_{\text{bread}} + f_{\text{no bread}}(\mathtt{time}) + f_{\text{bread}}(\mathtt{time}) + \cdots$$

It's kind of cheating, but in the first equation above, note that we are multiplying the smooth of time by the numeric value $\mathtt{bread}$

$$f(\mathtt{time}) \mathtt{bread}$$

and if $\mathtt{bread} = 0$ then we get no contribution at all from the bread, whereas in the second equation (the one for the by factor model) we will get a smooth of time base on all the meals in which which bread was not consumed and that doesn't sound useful to me.

You could create all possible combinations of food stuffs observed in the data set, create a factor from that variable and use that to fit a whole bunch of smooths, sort of like this (in pseudocode)

meals <- transform(meals,
                   food_stuff = interaction(bread, butter, breakfast, 
                                            drop = TRUE))

gam(glucose ~ food_stuff + s(time, by = food_stuff), ....)

but that could end up creating a huge number of smooths (each using 9 parameters each by default) if you have even a modest number of food stuffs.

Whether the continuous version works will depend on whether you need some indication of how blood glucose changes over time post meal or not. For example, you might want:

$$\text{glucose} = \beta_0 + f(\mathtt{time}) + f(\mathtt{time}) \mathtt{bread} + \cdots$$

and set up the smooth for the by term to have a penalty on the first derivative so that it penalises deviations from the average glucose time course:

gam(glucose ~ s(time) + s(time, by = bread, m = 1) + ....)

where the m = 1 will set the penalty to be on the first derivative, (i.e. deviations from a constant function).

What you want will depend on how you understand glucose to vary over time with different combinations of food stuffs and that's not something we can help you with unless you actually spell this out in your question. Even then you're likely to want to seek local support as this seems like a relatively complicated modelling exercise requiring some detailed input from a statistician.

Related Question