Solved – How to deal with high correlation among predictors in multiple regression

I found a reference in an article that goes like:

According to Tabachnick & Fidell (1996) the independent variables with
a bivariate correlation more than .70 should not be included in
multiple regression analysis.

Problem: I used in a multiple regression design 3 variables correlated >.80, VIF's at about .2 – .3, Tolerance ~ 4- 5. I cannot exclude any of them (important predictors and outcome). When I regressed the outcome on the 2 predictors which correlated at .80, they remained both significant, each predicted important variances, and these same two variables have the largest part and semipartial correlation coefficients amongst all 10 variables included (5 controls).

Question: Is my model valid despite high correlations? Any references greatly welcomed!

Thank you for the answers!

I did not use Tabachnick and Fidell as a guideline, I found this reference in an article dealing with high collinearity amongst predictors.

So, basically, I have too few cases for the number of predictors in the model (many categorical, dummy coded control variables- age, tenure, gender, etc) – 13 variables for 72 cases. The Condition Index is ~ 29 with all the controls in and ~23 without them (5 variables).

I cannot drop any variable or use factorial analysis to combine them because theoretically they have sense on their own. It is too late to get more data. Since I am conducting the analysis in SPSS perhaps it would be best to find a syntax for ridge regression (although I haven't done this before and interpreting the results would be new to me).

If it matters, when I conducted stepwise regression, the same 2 highly correlated variables remained the single significant predictors of the outcome.

And I still do not understand if the partial correlations which are high for each of these variables matter as an explanation for why I have kept them in the model (in case ridge regression can't be performed).

Would you say the "Regression diagnostic: identifying influential data and sources of collinearity / David A. Belsley, Edwin Kuh and Roy E. Welsch, 1980" would be helpful in understanding multicollinearity? Or might other references be useful?

Best Answer

The key problem is not correlation but collinearity (see works by Belsley, for instance). This is best tested using condition indexes (available in R, SAS and probably other programs as well. Correlation is neither a necessary nor a sufficient condition for collinearity. Condition indexes over 10 (per Belsley) indicate moderate collinearity, over 30 severe, but it also depends on which variables are involved in the collinearity.

If you do find high collinearity, it means that your parameter estimates are unstable. That is, small changes (sometimes in the 4th significant figure) in your data can cause big changes in your parameter estimates (sometimes even reversing their sign). This is a bad thing.

Remedies are

Getting more data
Dropping one variable
Combining the variables (e.g. with partial least squares) and
Performing ridge regression, which gives biased results but reduces the variance on the estimates.

Best Answer

Related Solutions

Solved – When using glmnet how to report p-value significance to claim significance of predictors

Solved – How to identify which predictors should be included in a multiple regression

Related Question