Solved – How to calculate the variance inflation factor for a categorical predictor variable when examining multicollinearity in a linear regression model

categorical datamulticollinearitymultiple regressionvariance-inflation-factor

For example, if we have the linear regression model:

$$E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 $$

where $ x_1 =\begin{cases} 1 & \mbox{if level 2} \\ 0 & \mbox{otherwise} \end{cases}$ and $x_2, x_3 $ are quantitative.

When checking for multicollinearity, we typically compute the linear regression models for each independent variable as a function of the remaining independent variables:

\begin{align}
E[x_1] &= \alpha_0 + \alpha_2 x_2 + \alpha_3 x_3 \\
E[x_2] &= \alpha_0 + \alpha_1 x_1 + \alpha_3 x_3 \\
E[x_3] &= \alpha_0 + \alpha_2 x_2 + \alpha_2 x_2
\end{align}

And from there, we derive the VIF for each term from
$R^{2}_{i}\, \forall i \in \{1, 2, 3\} $
(just the $R^2$ for the models above).

My problem is with $ E[x_1]$ which is possibly non-sensical, or at least counter-intuitive, as the possible values of $ x_1 $ are only $0$ or $1$. Of course, computing $ E[x_1]$ does make sense mathematically, but is this exactly what we need? Please explain.

Is there a way to calculate the $VIF$ for a categorical variable to check for how it is affected by multicollinearity?

(And an extra thank you if you know of an library in R which calculates the VIF for a linear regression model. Maybe the vif function?)

Best Answer

The function you requested comes in the package {car} in R.

I tried to figure it out running some regression models using the mtcars package in R.

Evidently, I can get the VIF both using the function and manually, when the regressor is a continuous variable:

require(car)
attach(mtcars)

fit1 <- lm(mpg ~ wt + hp + disp)     # The model we want.
fit_wt <- lm(wt ~ hp + disp)         # Regressing wt against other regressors.
rsq_wt <- summary(fit_wt)$r.square   # Detecting the R square of the model
(v_wt <- 1/(1 - (rsq_wt)))           # Actual formula for VIF
vif(fit1)                            # R built-in function

Now for the real question, here is what I find. Let's say that your regressor is am, which corresponds to the categorical variable for the type of transmission of the car (automatic versus manual).

Ordinarily, you would fit a model such as:

fit2 <- lm(mpg ~ wt + disp + as.factor(am))

The problem is that if you try now to get the VIF for am by just reshuffling the regressors you get an error message:

fit_am <- lm(as.factor(am) ~ wt + disp)
Warning messages:
1: In model.response(mf, "numeric") :
  using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : - not meaningful for factors

Game over? Not quite... Look what happens if I treat am as continuous:

> fit2 <- lm(mpg ~ wt + disp + as.factor(am))
> fit_am <- lm(am ~ wt + disp)
> rsq_am <- summary(fit_am)$r.square
> (v_am <- 1/(1 - (rsq_am)))
[1] 1.931264
> vif(fit2)
           wt          disp as.factor(am) 
     5.939675      4.752561      1.931264

We get the same value manually as with the R built-in function vif.

Related Solutions

Solved – VIF (Variance Inflation Factor) and correlation in linear regression

No. In this particular case with two independent variables it is not possible.

$Y = \beta_1 * X_1 + \beta_2 * X_2 * \epsilon$

The VIF is calculated as a three step procedure

Running an OLS from $X_2$ on $X_1$

$X_1$ = $c_0$ + $\alpha * X_2$ + $\epsilon$

Calculate the VIF

$VIF_i$ = $\frac{1}{1-R^2_{i}}$

Analyze the VIF. What is a large VIF. Some people say >4, some >10, some >15.

While the correlation is computed in the following way.

$\rho_{x,y}$ = $corr(x,y)$ = $\frac{cov(x,y)}{\rho_{x}\rho{y}}$ = $\frac{E[(X-\mu_x)(Y-\mu_y)]}{\rho_x \rho_y}$

You should not worry if the correlation is between -0.5 and 0.5. Some people even say that a correlation between -0.8/-0.7 and 0.7/0.8 is no major problem.

You should see that both measures only represent a linear relationship between $X_1$ and $X_2$. So they cannot yield completely different measures.

If the correlation and the VIF are somewhat contradictory I propose the following procedures.

What if you eliminate a variable? Do these regression yield to different results? If yes, there might be correlation.

$Y = \beta_1 X_1 + \epsilon$

$Y = \beta_2 X_2 + \epsilon$

Apply a ridge regression which is more robust to multicollinearity than an OLS regression. IF results differ there might be multicollinearity.
Are the variables logically related? e.g. If the two variables are weight and height of people than you already know without a regression that presumably tall people are heavier.

Solved – How to test whether $\beta_1= \beta_3 = 0.5$ using R (without using offset function)

First Create the model

data = fread(paste0("http://www1.aucegypt.edu/faculty/hadi/RABE5/Data5/", "P060.txt"))
model <- lm(data = data, Y ~ X1 + X3)

Then you can use the following code:

library(car)
linearHypothesis(model, c("X1=X3", "X1=0.5"))

You will get the same output with less code and hassle.

Best Answer

Related Solutions

Solved – VIF (Variance Inflation Factor) and correlation in linear regression

Solved – How to test whether $\beta_1= \beta_3 = 0.5$ using R (without using offset function)

Related Question