Longitudinal Data – Understanding Time Series, Repeated Measures, and Other Longitudinal Data Types

mixed modelpanel dataregressionrepeated measurestime series

In plain English:
I have a multiple regression or ANOVA model but the response variable for each individual is a curvilinear function of time.

  • How can I tell which of the right-hand-side variables are responsible for significant differences in the shapes or vertical offsets of the curves?
  • Is this a time-series problem, a repeated-measures problem, or something else entirely?
  • What are the best-practices for analyzing such data (preferably in R, but I'm open to using other software)?

In more precise terms:
Let's say I have a model $y_{ijk} = \beta_0 + \beta_1 x_i + \beta_2 x_j + \beta_3 x_i x_j + \epsilon_k$ but $y_{ijk}$ is actually a series of data-points collected from the same individual $k$ at many time-points $t$, which were recorded as a numeric variable. Plotting the data shows that for each individual $y_{ijkt}$ is a quadratic or cyclical function of time whose vertical offset, shape, or frequency (in the cyclical case) might significantly depend on the covariates. The covariates do not change over time– i.e., an individual has a constant body weight or treatment group for the duration of the data collection period.

So far I have tried the following R approaches:

  1. Manova

    Anova(lm(YT~A*B,mydata),idata=data.frame(TIME=factor(c(1:10))),idesign=~TIME); 
    

    …where YT is a matrix whose columns are the time points, 10 of them in this example, but far more in the real data.

    Problem: this treats time as a factor, but the time-points don't exactly match for each individual. Furthermore, there are many of them relative to the sample size so the model gets saturated. It seems like the shape of the response variable over time is ignored.

  2. Mixed-model (as in Pinheiro and Bates, Mixed Effect Models in S and S-Plus)

    lme(fixed=Y~ A*B*TIME + sin(2*pi*TIME) + cos(2*pi*TIME), data=mydata, 
        random=~(TIME + sin(2*pi*TIME) + cos(2*pi*TIME))|ID), method='ML')
    

    …where ID is a factor that groups data by individual. In this example the response is cyclical over time, but there could instead be quadratic terms or other functions of time.

    Problem: I'm not certain whether each time term is necessary (especially for quadratic terms) and which ones are affected by which covariates.

    • Is stepAIC() a good method for selecting them?
    • If it does remove a time-dependent term, will it also remove it from the random argument?
    • What if I also use an autocorrelation function (such as corEXP()) that takes a formula in the correlation argument– should I make that formula for corEXP() the same as the one in random or just ~1|ID?
    • The nlme package is rarely mentioned in the context of time series outside Pinheiro and Bates… is it not considered well suited to this problem?
  3. Fitting a quadratic or trigonometric model to each individual, and then using each coefficient as a response variable for multiple regression or ANOVA.

    Problem: Multiple comparison correction necessary. Can't think of any other problems which makes me suspicious that I'm overlooking something.

  4. As previously suggested on this site (What is the term for a time series regression having more than one predictor?), there are ARIMAX and transfer function / dynamic regression models.

    Problem: ARMA-based models assume discrete times, don't they? As for dynamic regression, I heard about it for the first time today, but before I delve into yet another new method that might not pan out after all, I thought it would be prudent to ask people who have done this before for advice.

Best Answer

As Jeromy Anglim said, it would help to know the number of time points you have for each individual; as you said "many" I would venture that functional analysis might be a viable alternative. You might want to check the R package fda and look at the book by Ramsay and Silverman.

Related Question