I did a stepwise regrssion analysis to predict energy expenditure using the variables, height, weight, age, gender and energy intake. The final model contains the variables gender and weight. Now does this final model take into account gender by weight interaction? Or do I have to construct a new equation from this final model, that will accomodate interaction?
Solved – Interaction in stepwise regression analysis
interactionstepwise regression
Related Solutions
Neither vifs nor stepwise tell you what is dependent on what. For that, you want condition indices. In R
you can get these from the perturb
package using the coldiag
function.
There, you first look at the condition index for those that are high (some suggest > 10, others > 30). Then, for those indices, you look at the variables that contribute a large proportion of variance.
EDIT to clarify (from colldiag documentation)
library(perturb)
data(consumption)
ct1 <- with(consumption, c(NA,cons[-length(cons)]))
m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
cd<-colldiag(m1)
cd
Gives
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[Workspace loaded from C:/personal/abalone/.RData]
> library(perturb)
> ?coldiag
No documentation for ‘coldiag’ in specified packages and libraries:
you could try ‘??coldiag’
> ls(2)
[1] "colldiag" "perturb"
[3] "print.summary.perturb" "reclassify"
[5] "summary.perturb"
> ?colldiag
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
Error in with(consumption, c(NA, cons[-length(cons)])) :
object 'consumption' not found
> data(consumption)
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
> cd<-colldiag(m1)
> cd
Condition
Index Variance Decomposition Proportions
intercept ct1 dpi
1 1.000 0.001 0.000 0.000
2 4.143 0.004 0.000 0.000
3 7.799 0.310 0.000 0.000
4 39.406 0.263 0.005 0.005
5 375.614 0.421 0.995 0.995
rate d_dpi
1 0.000 0.002
2 0.001 0.136
3 0.013 0.001
4 0.984 0.048
5 0.001 0.814
> print(cd,fuzz=.3)
Condition
Index Variance Decomposition Proportions
intercept ct1 dpi
1 1.000 . . .
2 4.143 . . .
3 7.799 0.310 . .
4 39.406 . . .
5 375.614 0.421 0.995 0.995
rate d_dpi
1 . .
2 . .
3 . .
4 0.984 .
5 . 0.814
> cd
Condition
Index Variance Decomposition Proportions
intercept ct1 dpi rate d_dpi
1 1.000 0.001 0.000 0.000 0.000 0.002
2 4.143 0.004 0.000 0.000 0.001 0.136
3 7.799 0.310 0.000 0.000 0.013 0.001
4 39.406 0.263 0.005 0.005 0.984 0.048
5 375.614 0.421 0.995 0.995 0.001 0.814
The first column is just an identifier. The second is the condition index. The others are the proportions.
The bottom line shows clearly problematic collinearity (375 is >> 30). So, which variables are contributing? ct1 and dpi and d_dpi all have high variance decompositions; all three are contributing. You need to do something about this
The 4th line has a problematic condition index (39) but only one variable is contributing much, so there is not much to do.
The simple answer is No. Subsampling will not help.
If by subsampling you mean a balanced sample so that the ratio of events changes from 200/1000 to 200/400. This is only used in classification models and is of no use (generally) in maximum-likelihood / probability models.
What the comments are trying to suggest is that there are many other larger issues revealed in questions that could be textbook chapters by themselves:
- Sample size of logistic models is measured by number of events, model building as events-per-variable. With 8 EPV (assuming all are continuous predictors otherwise less) your sample size in relationship to your number of predictors is small which is going to cause issues.
- Detecting interactions is notoriously difficult.
- Forward/backward or combination variable selection methods have major issues. This a popular topic on stackexchange crossvalidated. It is likely overemphasized with limited issues with EPV>50. But this is the classic situation where you are going to get mislead by automated variable selection methods (Austin: Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality)
- variable selection is hard
- prediction models and descriptive models often require different methods. Not always. Differences often overemphasized. But from the limited information available it seems like these two goals are going to be difficult to combine in this case.
- Evaluation of logistic models avoid using correctly classified. Classification in logistic regression often based on arbitrary probability cut-off. Fair review: Steyerberg: "Assessing the performance..." http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575184/
Best Answer
Stepwise regression has a number of bad qualities (just look e.g. here or here and this is one of them. With only 5 variables (plus one interaction), unless your sample size is small you can include all the IVs. Why would you do this?
1) It is clear, even to a non-expert, that all your potential IVs are included for strong theoretical reasons. Finding that one of them was not related to energy expenditure would be remarkable. In fact, it would be more remarkable than finding that all of them were related.
2) Even if a variable is not significant and even if its coefficient is small, it might affect the relationship between the DV and other IVs.