Solved – Variable selection for time covariate

linear modelmodel selectionrregressiontime series

I'm fitting a linear model where the response is a function both of time and of static covariates (i.e. ones that are independent of time). The ultimate goal is to identify significant effects of the static covariates.

Is this the best general strategy for selecting variables (in R, using the nlme package)? Anything I can do better?

  1. Break the data up by groups and plot it against time. For continuous covariates, bin it and plot the data in each bin against time. Use the group-specific trends to make an initial guess at what time terms to include– time, time^n, sin(2*pi*time)+cos(2*pi*time), log(time), exp(time), etc.
  2. Add one term at a time, comparing each model to its predecessor, never adding a higher order in the absence of lower order terms. Sin and cos are never added separately. Is it acceptable to pass over a term that significantly improves the fit of the model if there is no physical interpretation of that term?.
  3. With the full dataset, use forward selection to add static variables to the model and then relevant interaction terms with each other and with the time terms. I've seen some strong criticism of stepwise regression, but doesn't forward selection ignore significant higher order terms if the lower order terms they depend on are not significant? And I've noticed that it's hard to pick a starting model for backward elimination that isn't saturated, or singular, or fails to converge. How do you decide between variable selection algorithms?
  4. Add random effects to the model. Is this as simple as doing the variable selection using lm() and then putting the final formula into lme() and specifying the random effects? Or should I include random effects from the very start?. Compare the fits of models using a random intercept only, a random interaction with the linear time term, and random interaction with each successive time term.
  5. Plot a semivariogram to see if an autoregressive error term is needed. What should a semivariogram look like if the answer is 'no'? A horizontal line? How straight, how horizontal? Does including autoregression in the model again require checking potential variables and interactions to make sure they're still relevant?
  6. Plot the residuals to see if the variance changes as a function of fitted value, time, or any of the other terms. If it does, weigh the variances appropriately (for lme(), use the weights argument to specify a varFunc()) and compare to the unweighted model to see if this improves the fit. Is this the right sequence in which to do this step, or should it be done before autocorrelation?.
  7. Do summary() of the fitted model to identify significant coefficients for numeric covariates. Do Anova() of the fitted model to identify significant effects for qualitative covariates.

Best Answer

Fully data-driven model selection will result in standard errors and P-values that are too small, confidence intervals that are too narrow, and overstated effects of remaining terms in the model.

For time effects I usually model using restricted cubic splines. A detailed case study in the context of generalized least squares for correlated serial data may be found at http://biostat.mc.vanderbilt.edu/RmS - see the two attachments at the bottom named course2.pdf and rms.pdf. This uses the R rms package. The case study contains information about the choice of basis functions for the time component.