Solved – Variable selection for time covariate

linear modelmodel selectionrregressiontime series

I'm fitting a linear model where the response is a function both of time and of static covariates (i.e. ones that are independent of time). The ultimate goal is to identify significant effects of the static covariates.

Is this the best general strategy for selecting variables (in R, using the nlme package)? Anything I can do better?

Break the data up by groups and plot it against time. For continuous covariates, bin it and plot the data in each bin against time. Use the group-specific trends to make an initial guess at what time terms to include– time, time^n, sin(2*pi*time)+cos(2*pi*time), log(time), exp(time), etc.
Add one term at a time, comparing each model to its predecessor, never adding a higher order in the absence of lower order terms. Sin and cos are never added separately. Is it acceptable to pass over a term that significantly improves the fit of the model if there is no physical interpretation of that term?.
With the full dataset, use forward selection to add static variables to the model and then relevant interaction terms with each other and with the time terms. I've seen some strong criticism of stepwise regression, but doesn't forward selection ignore significant higher order terms if the lower order terms they depend on are not significant? And I've noticed that it's hard to pick a starting model for backward elimination that isn't saturated, or singular, or fails to converge. How do you decide between variable selection algorithms?
Add random effects to the model. Is this as simple as doing the variable selection using lm() and then putting the final formula into lme() and specifying the random effects? Or should I include random effects from the very start?. Compare the fits of models using a random intercept only, a random interaction with the linear time term, and random interaction with each successive time term.
Plot a semivariogram to see if an autoregressive error term is needed. What should a semivariogram look like if the answer is 'no'? A horizontal line? How straight, how horizontal? Does including autoregression in the model again require checking potential variables and interactions to make sure they're still relevant?
Plot the residuals to see if the variance changes as a function of fitted value, time, or any of the other terms. If it does, weigh the variances appropriately (for lme(), use the weights argument to specify a varFunc()) and compare to the unweighted model to see if this improves the fit. Is this the right sequence in which to do this step, or should it be done before autocorrelation?.
Do summary() of the fitted model to identify significant coefficients for numeric covariates. Do Anova() of the fitted model to identify significant effects for qualitative covariates.

Best Answer

Fully data-driven model selection will result in standard errors and P-values that are too small, confidence intervals that are too narrow, and overstated effects of remaining terms in the model.

For time effects I usually model using restricted cubic splines. A detailed case study in the context of generalized least squares for correlated serial data may be found at http://biostat.mc.vanderbilt.edu/RmS - see the two attachments at the bottom named course2.pdf and rms.pdf. This uses the R rms package. The case study contains information about the choice of basis functions for the time component.

Related Solutions

R – Interaction Terms and Higher Order Polynomials

Yes, you should always include all of the terms, from the highest order all the way down to the linear term, in the interaction. There are a couple of really great threads on CV that discuss related issues that you might find helpful in thinking about this:

The short answer is that by not including certain terms in the model, you force parts of it to be exactly zero. This imposes an inflexibility to your model that necessarily causes bias, unless those parameters are exactly zero in reality; the situation is analogous to suppressing the intercept (which you can see discussed here).

You should also be aware that any automatic model selection routine is dangerous. (For the basic story, it may be helpful to read my answer here.) In addition to that, however, these algorithms don't 'think' in terms of the relationships between variables, so they don't necessarily keep lower level terms in the model when power or interaction terms are included.

Solved – Confused about multicollinearity, variable selection and interaction terms

Neither vifs nor stepwise tell you what is dependent on what. For that, you want condition indices. In R you can get these from the perturb package using the coldiag function.

There, you first look at the condition index for those that are high (some suggest > 10, others > 30). Then, for those indices, you look at the variables that contribute a large proportion of variance.

EDIT to clarify (from colldiag documentation)

    library(perturb)
    data(consumption)
    ct1 <- with(consumption, c(NA,cons[-length(cons)]))
    m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
    cd<-colldiag(m1)
    cd

Gives


R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Workspace loaded from C:/personal/abalone/.RData]

> library(perturb)
> ?coldiag
No documentation for ‘coldiag’ in specified packages and libraries:
you could try ‘??coldiag’
> ls(2)
[1] "colldiag"              "perturb"              
[3] "print.summary.perturb" "reclassify"           
[5] "summary.perturb"      
> ?colldiag
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
Error in with(consumption, c(NA, cons[-length(cons)])) : 
  object 'consumption' not found
> data(consumption)
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
> cd<-colldiag(m1)
> cd
Condition
Index   Variance Decomposition Proportions
           intercept ct1   dpi  
1    1.000 0.001     0.000 0.000
2    4.143 0.004     0.000 0.000
3    7.799 0.310     0.000 0.000
4   39.406 0.263     0.005 0.005
5  375.614 0.421     0.995 0.995
  rate  d_dpi
1 0.000 0.002
2 0.001 0.136
3 0.013 0.001
4 0.984 0.048
5 0.001 0.814
> print(cd,fuzz=.3)
Condition
Index   Variance Decomposition Proportions
           intercept ct1   dpi  
1    1.000  .         .     .   
2    4.143  .         .     .   
3    7.799 0.310      .     .   
4   39.406  .         .     .   
5  375.614 0.421     0.995 0.995
  rate  d_dpi
1  .     .   
2  .     .   
3  .     .   
4 0.984  .   
5  .    0.814
> cd

Condition
Index        Variance Decomposition Proportions
           intercept ct1   dpi   rate  d_dpi
1    1.000 0.001     0.000 0.000 0.000 0.002
2    4.143 0.004     0.000 0.000 0.001 0.136
3    7.799 0.310     0.000 0.000 0.013 0.001
4   39.406 0.263     0.005 0.005 0.984 0.048
5  375.614 0.421     0.995 0.995 0.001 0.814

The first column is just an identifier. The second is the condition index. The others are the proportions.

The bottom line shows clearly problematic collinearity (375 is >> 30). So, which variables are contributing? ct1 and dpi and d_dpi all have high variance decompositions; all three are contributing. You need to do something about this

The 4th line has a problematic condition index (39) but only one variable is contributing much, so there is not much to do.

Best Answer

Related Solutions

R – Interaction Terms and Higher Order Polynomials

Solved – Confused about multicollinearity, variable selection and interaction terms

Related Question