I have run a few tests/methods on my data and am getting contradictory results.
I have a linear model saying:
reg1 = lm(weight = height + age + gender (categorical) + several other variables).
If I model each term linearly i.e. no squared or interaction term, and run vif(reg1), 4 variables are >15. If I delete the variable with the highest vif number and re-run it the gifs change and now only 2 variables are >15. I repeat this until I'm left with 20 variables (out of 30) below 10. If I use stepwise directly on reg1 then it does not delete the 'highest vic' factor. I don't understand how it tells me 'what' is linearly dependant on 'what variable' and how (and I cannot seem to find this information despite googling for ages).
Furthermore, when I look at the residual plots, most appear horizontal except a few which are upside down u curved (none of these have high vifs). Does this means a transformation is needed? (I removed outliers, leverage points etc – but now there seem to be more!)
reg2 = lm(weight = (height + age + gender (categorical) + several other variables)^2).
If I run vif on this all of the terms are >500!
What else I have tried (without cutting any variables):
(1) The errors seem correlated when i run diagnostics and check with Durbin Waston statistics indicating the model is not linear… however…
(2) Box Cox gives lambda = 1 so no transformation is needed.
(3) LASSO gives the lowest mallows cp on the full 30 variable model (i.e. least squares)
(4) Ridge regression gives lambda = 0 which did surprise me.
I'm getting really confused about this data. To determine a suitable model for weight should I be looking just at linear terms or linear and interaction terms (remember there are 25 variables so there are 30^2 interaction terms)?
When I check which ones are significant in reg2 only 12 predictors and 6 interaction terms seem significant (AIC is lowest with this combination after I run step). Should I just use this 'new model with deleted variables/interaction terms' and do all my tests e.g. stepwise method, LASSO etc or do I do it on the entire model?
I'm getting quite lost in terms of making sense of steps to find a suitable model for weight using the variables.
My final question is once I have the model – how do i test/prove its the best/a decent model?
Any help would really be appreciated.
Best Answer
Neither vifs nor stepwise tell you what is dependent on what. For that, you want condition indices. In
R
you can get these from theperturb
package using thecoldiag
function.There, you first look at the condition index for those that are high (some suggest > 10, others > 30). Then, for those indices, you look at the variables that contribute a large proportion of variance.
EDIT to clarify (from colldiag documentation)
The first column is just an identifier. The second is the condition index. The others are the proportions.
The bottom line shows clearly problematic collinearity (375 is >> 30). So, which variables are contributing? ct1 and dpi and d_dpi all have high variance decompositions; all three are contributing. You need to do something about this
The 4th line has a problematic condition index (39) but only one variable is contributing much, so there is not much to do.