Solved – VIF (Variance Inflation Factor) and correlation in linear regression

correlationmulticollinearityregressionself-studyvariance-inflation-factor

Linear regression: $Y = X_1 + X_2$

Is that possible that $X_1$ could have a low $VIF (1.25)$ and the same time, have a $0.35$ correlation with $X_2$? If $X_1$ has almost 1 of correlation with $X_2$, implies that VIF will be higher for $X_1$?

Best Answer

No. In this particular case with two independent variables it is not possible.

$Y = \beta_1 * X_1 + \beta_2 * X_2 * \epsilon$

The VIF is calculated as a three step procedure

Running an OLS from $X_2$ on $X_1$

$X_1$ = $c_0$ + $\alpha * X_2$ + $\epsilon$

Calculate the VIF

$VIF_i$ = $\frac{1}{1-R^2_{i}}$

Analyze the VIF. What is a large VIF. Some people say >4, some >10, some >15.

While the correlation is computed in the following way.

$\rho_{x,y}$ = $corr(x,y)$ = $\frac{cov(x,y)}{\rho_{x}\rho{y}}$ = $\frac{E[(X-\mu_x)(Y-\mu_y)]}{\rho_x \rho_y}$

You should not worry if the correlation is between -0.5 and 0.5. Some people even say that a correlation between -0.8/-0.7 and 0.7/0.8 is no major problem.

You should see that both measures only represent a linear relationship between $X_1$ and $X_2$. So they cannot yield completely different measures.

If the correlation and the VIF are somewhat contradictory I propose the following procedures.

What if you eliminate a variable? Do these regression yield to different results? If yes, there might be correlation.

$Y = \beta_1 X_1 + \epsilon$

$Y = \beta_2 X_2 + \epsilon$

Apply a ridge regression which is more robust to multicollinearity than an OLS regression. IF results differ there might be multicollinearity.
Are the variables logically related? e.g. If the two variables are weight and height of people than you already know without a regression that presumably tall people are heavier.

Related Solutions

Solved – How to calculate the variance inflation factor for a categorical predictor variable when examining multicollinearity in a linear regression model

The function you requested comes in the package {car} in R.

I tried to figure it out running some regression models using the mtcars package in R.

Evidently, I can get the VIF both using the function and manually, when the regressor is a continuous variable:

require(car)
attach(mtcars)

fit1 <- lm(mpg ~ wt + hp + disp)     # The model we want.
fit_wt <- lm(wt ~ hp + disp)         # Regressing wt against other regressors.
rsq_wt <- summary(fit_wt)$r.square   # Detecting the R square of the model
(v_wt <- 1/(1 - (rsq_wt)))           # Actual formula for VIF
vif(fit1)                            # R built-in function

Now for the real question, here is what I find. Let's say that your regressor is am, which corresponds to the categorical variable for the type of transmission of the car (automatic versus manual).

Ordinarily, you would fit a model such as:

fit2 <- lm(mpg ~ wt + disp + as.factor(am))

The problem is that if you try now to get the VIF for am by just reshuffling the regressors you get an error message:

fit_am <- lm(as.factor(am) ~ wt + disp)
Warning messages:
1: In model.response(mf, "numeric") :
  using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : - not meaningful for factors

Game over? Not quite... Look what happens if I treat am as continuous:

> fit2 <- lm(mpg ~ wt + disp + as.factor(am))
> fit_am <- lm(am ~ wt + disp)
> rsq_am <- summary(fit_am)$r.square
> (v_am <- 1/(1 - (rsq_am)))
[1] 1.931264
> vif(fit2)
           wt          disp as.factor(am) 
     5.939675      4.752561      1.931264

We get the same value manually as with the R built-in function vif.

Correlation and Multicollinearity – VIF vs Correlation Explained

First, I think it is better to use condition indexes rather than VIF to diagnose collinearity. See the work of David Belsley or even (if you want a soporific) my dissertation (that link seems to have vanished; this one should work (I hope).

However, to get to your question: It is possible to have very low correlations among all variables but perfect collinearity. If you have 11 independent variables, 10 of which are independent and the 11th is the sum of the other 10, then correlations will be about 0.1 but collinearity is perfect. So, high VIF does not imply high correlations.

It is also true that you can have pretty high correlations without it creating troublesome collinearity, but this is trickier to show. See the references.

Best Answer

Related Solutions

Solved – How to calculate the variance inflation factor for a categorical predictor variable when examining multicollinearity in a linear regression model

Correlation and Multicollinearity – VIF vs Correlation Explained

Related Question