Solved – Interaction in stepwise regression analysis

interactionstepwise regression

I did a stepwise regrssion analysis to predict energy expenditure using the variables, height, weight, age, gender and energy intake. The final model contains the variables gender and weight. Now does this final model take into account gender by weight interaction? Or do I have to construct a new equation from this final model, that will accomodate interaction?

Best Answer

Stepwise regression has a number of bad qualities (just look e.g. here or here and this is one of them. With only 5 variables (plus one interaction), unless your sample size is small you can include all the IVs. Why would you do this?

1) It is clear, even to a non-expert, that all your potential IVs are included for strong theoretical reasons. Finding that one of them was not related to energy expenditure would be remarkable. In fact, it would be more remarkable than finding that all of them were related.

2) Even if a variable is not significant and even if its coefficient is small, it might affect the relationship between the DV and other IVs.

Related Solutions

Solved – Confused about multicollinearity, variable selection and interaction terms

Neither vifs nor stepwise tell you what is dependent on what. For that, you want condition indices. In R you can get these from the perturb package using the coldiag function.

There, you first look at the condition index for those that are high (some suggest > 10, others > 30). Then, for those indices, you look at the variables that contribute a large proportion of variance.

EDIT to clarify (from colldiag documentation)

    library(perturb)
    data(consumption)
    ct1 <- with(consumption, c(NA,cons[-length(cons)]))
    m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
    cd<-colldiag(m1)
    cd

Gives


R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Workspace loaded from C:/personal/abalone/.RData]

> library(perturb)
> ?coldiag
No documentation for ‘coldiag’ in specified packages and libraries:
you could try ‘??coldiag’
> ls(2)
[1] "colldiag"              "perturb"              
[3] "print.summary.perturb" "reclassify"           
[5] "summary.perturb"      
> ?colldiag
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
Error in with(consumption, c(NA, cons[-length(cons)])) : 
  object 'consumption' not found
> data(consumption)
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
> cd<-colldiag(m1)
> cd
Condition
Index   Variance Decomposition Proportions
           intercept ct1   dpi  
1    1.000 0.001     0.000 0.000
2    4.143 0.004     0.000 0.000
3    7.799 0.310     0.000 0.000
4   39.406 0.263     0.005 0.005
5  375.614 0.421     0.995 0.995
  rate  d_dpi
1 0.000 0.002
2 0.001 0.136
3 0.013 0.001
4 0.984 0.048
5 0.001 0.814
> print(cd,fuzz=.3)
Condition
Index   Variance Decomposition Proportions
           intercept ct1   dpi  
1    1.000  .         .     .   
2    4.143  .         .     .   
3    7.799 0.310      .     .   
4   39.406  .         .     .   
5  375.614 0.421     0.995 0.995
  rate  d_dpi
1  .     .   
2  .     .   
3  .     .   
4 0.984  .   
5  .    0.814
> cd

Condition
Index        Variance Decomposition Proportions
           intercept ct1   dpi   rate  d_dpi
1    1.000 0.001     0.000 0.000 0.000 0.002
2    4.143 0.004     0.000 0.000 0.001 0.136
3    7.799 0.310     0.000 0.000 0.013 0.001
4   39.406 0.263     0.005 0.005 0.984 0.048
5  375.614 0.421     0.995 0.995 0.001 0.814

The first column is just an identifier. The second is the condition index. The others are the proportions.

The bottom line shows clearly problematic collinearity (375 is >> 30). So, which variables are contributing? ct1 and dpi and d_dpi all have high variance decompositions; all three are contributing. You need to do something about this

The 4th line has a problematic condition index (39) but only one variable is contributing much, so there is not much to do.

Solved – Stepwise logistic regression

The simple answer is No. Subsampling will not help.
If by subsampling you mean a balanced sample so that the ratio of events changes from 200/1000 to 200/400. This is only used in classification models and is of no use (generally) in maximum-likelihood / probability models.

What the comments are trying to suggest is that there are many other larger issues revealed in questions that could be textbook chapters by themselves:

Sample size of logistic models is measured by number of events, model building as events-per-variable. With 8 EPV (assuming all are continuous predictors otherwise less) your sample size in relationship to your number of predictors is small which is going to cause issues.
Detecting interactions is notoriously difficult.
Forward/backward or combination variable selection methods have major issues. This a popular topic on stackexchange crossvalidated. It is likely overemphasized with limited issues with EPV>50. But this is the classic situation where you are going to get mislead by automated variable selection methods (Austin: Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality)
variable selection is hard
prediction models and descriptive models often require different methods. Not always. Differences often overemphasized. But from the limited information available it seems like these two goals are going to be difficult to combine in this case.
Evaluation of logistic models avoid using correctly classified. Classification in logistic regression often based on arbitrary probability cut-off. Fair review: Steyerberg: "Assessing the performance..." http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575184/

Best Answer

Related Solutions

Solved – Confused about multicollinearity, variable selection and interaction terms

Solved – Stepwise logistic regression

Related Question