Solved – Confused about multicollinearity, variable selection and interaction terms

interactionmodel selectionmulticollinearityrregression

I have run a few tests/methods on my data and am getting contradictory results.

I have a linear model saying:
reg1 = lm(weight = height + age + gender (categorical) + several other variables).

If I model each term linearly i.e. no squared or interaction term, and run vif(reg1), 4 variables are >15. If I delete the variable with the highest vif number and re-run it the gifs change and now only 2 variables are >15. I repeat this until I'm left with 20 variables (out of 30) below 10. If I use stepwise directly on reg1 then it does not delete the 'highest vic' factor. I don't understand how it tells me 'what' is linearly dependant on 'what variable' and how (and I cannot seem to find this information despite googling for ages).

Furthermore, when I look at the residual plots, most appear horizontal except a few which are upside down u curved (none of these have high vifs). Does this means a transformation is needed? (I removed outliers, leverage points etc – but now there seem to be more!)

reg2 = lm(weight = (height + age + gender (categorical) + several other variables)^2).

If I run vif on this all of the terms are >500!

What else I have tried (without cutting any variables):
(1) The errors seem correlated when i run diagnostics and check with Durbin Waston statistics indicating the model is not linear… however…
(2) Box Cox gives lambda = 1 so no transformation is needed.
(3) LASSO gives the lowest mallows cp on the full 30 variable model (i.e. least squares)
(4) Ridge regression gives lambda = 0 which did surprise me.

I'm getting really confused about this data. To determine a suitable model for weight should I be looking just at linear terms or linear and interaction terms (remember there are 25 variables so there are 30^2 interaction terms)?

When I check which ones are significant in reg2 only 12 predictors and 6 interaction terms seem significant (AIC is lowest with this combination after I run step). Should I just use this 'new model with deleted variables/interaction terms' and do all my tests e.g. stepwise method, LASSO etc or do I do it on the entire model?

I'm getting quite lost in terms of making sense of steps to find a suitable model for weight using the variables.

My final question is once I have the model – how do i test/prove its the best/a decent model?

Any help would really be appreciated.

Best Answer

Neither vifs nor stepwise tell you what is dependent on what. For that, you want condition indices. In R you can get these from the perturb package using the coldiag function.

There, you first look at the condition index for those that are high (some suggest > 10, others > 30). Then, for those indices, you look at the variables that contribute a large proportion of variance.

EDIT to clarify (from colldiag documentation)

    library(perturb)
    data(consumption)
    ct1 <- with(consumption, c(NA,cons[-length(cons)]))
    m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
    cd<-colldiag(m1)
    cd

Gives


R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Workspace loaded from C:/personal/abalone/.RData]

> library(perturb)
> ?coldiag
No documentation for ‘coldiag’ in specified packages and libraries:
you could try ‘??coldiag’
> ls(2)
[1] "colldiag"              "perturb"              
[3] "print.summary.perturb" "reclassify"           
[5] "summary.perturb"      
> ?colldiag
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
Error in with(consumption, c(NA, cons[-length(cons)])) : 
  object 'consumption' not found
> data(consumption)
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
> cd<-colldiag(m1)
> cd
Condition
Index   Variance Decomposition Proportions
           intercept ct1   dpi  
1    1.000 0.001     0.000 0.000
2    4.143 0.004     0.000 0.000
3    7.799 0.310     0.000 0.000
4   39.406 0.263     0.005 0.005
5  375.614 0.421     0.995 0.995
  rate  d_dpi
1 0.000 0.002
2 0.001 0.136
3 0.013 0.001
4 0.984 0.048
5 0.001 0.814
> print(cd,fuzz=.3)
Condition
Index   Variance Decomposition Proportions
           intercept ct1   dpi  
1    1.000  .         .     .   
2    4.143  .         .     .   
3    7.799 0.310      .     .   
4   39.406  .         .     .   
5  375.614 0.421     0.995 0.995
  rate  d_dpi
1  .     .   
2  .     .   
3  .     .   
4 0.984  .   
5  .    0.814
> cd

Condition
Index        Variance Decomposition Proportions
           intercept ct1   dpi   rate  d_dpi
1    1.000 0.001     0.000 0.000 0.000 0.002
2    4.143 0.004     0.000 0.000 0.001 0.136
3    7.799 0.310     0.000 0.000 0.013 0.001
4   39.406 0.263     0.005 0.005 0.984 0.048
5  375.614 0.421     0.995 0.995 0.001 0.814

The first column is just an identifier. The second is the condition index. The others are the proportions.

The bottom line shows clearly problematic collinearity (375 is >> 30). So, which variables are contributing? ct1 and dpi and d_dpi all have high variance decompositions; all three are contributing. You need to do something about this

The 4th line has a problematic condition index (39) but only one variable is contributing much, so there is not much to do.