Solved – How to calculate the variance inflation factor for a categorical predictor variable when examining multicollinearity in a linear regression model

categorical datamulticollinearitymultiple regressionvariance-inflation-factor

For example, if we have the linear regression model:

$$E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 $$

where $ x_1 =\begin{cases} 1 & \mbox{if level 2} \\ 0 & \mbox{otherwise} \end{cases}$ and $x_2, x_3 $ are quantitative.

When checking for multicollinearity, we typically compute the linear regression models for each independent variable as a function of the remaining independent variables:

\begin{align}
E[x_1] &= \alpha_0 + \alpha_2 x_2 + \alpha_3 x_3 \\
E[x_2] &= \alpha_0 + \alpha_1 x_1 + \alpha_3 x_3 \\
E[x_3] &= \alpha_0 + \alpha_2 x_2 + \alpha_2 x_2
\end{align}

And from there, we derive the VIF for each term from
$R^{2}_{i}\, \forall i \in \{1, 2, 3\} $
(just the $R^2$ for the models above).

My problem is with $ E[x_1]$ which is possibly non-sensical, or at least counter-intuitive, as the possible values of $ x_1 $ are only $0$ or $1$. Of course, computing $ E[x_1]$ does make sense mathematically, but is this exactly what we need? Please explain.

Is there a way to calculate the $VIF$ for a categorical variable to check for how it is affected by multicollinearity?

(And an extra thank you if you know of an library in R which calculates the VIF for a linear regression model. Maybe the vif function?)

Best Answer

The function you requested comes in the package {car} in R.

I tried to figure it out running some regression models using the mtcars package in R.

Evidently, I can get the VIF both using the function and manually, when the regressor is a continuous variable:

require(car)
attach(mtcars)

fit1 <- lm(mpg ~ wt + hp + disp)     # The model we want.
fit_wt <- lm(wt ~ hp + disp)         # Regressing wt against other regressors.
rsq_wt <- summary(fit_wt)$r.square   # Detecting the R square of the model
(v_wt <- 1/(1 - (rsq_wt)))           # Actual formula for VIF
vif(fit1)                            # R built-in function

Now for the real question, here is what I find. Let's say that your regressor is am, which corresponds to the categorical variable for the type of transmission of the car (automatic versus manual).

Ordinarily, you would fit a model such as:

fit2 <- lm(mpg ~ wt + disp + as.factor(am))

The problem is that if you try now to get the VIF for am by just reshuffling the regressors you get an error message:

fit_am <- lm(as.factor(am) ~ wt + disp)
Warning messages:
1: In model.response(mf, "numeric") :
  using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : - not meaningful for factors

Game over? Not quite... Look what happens if I treat am as continuous:

> fit2 <- lm(mpg ~ wt + disp + as.factor(am))
> fit_am <- lm(am ~ wt + disp)
> rsq_am <- summary(fit_am)$r.square
> (v_am <- 1/(1 - (rsq_am)))
[1] 1.931264
> vif(fit2)
           wt          disp as.factor(am) 
     5.939675      4.752561      1.931264 

We get the same value manually as with the R built-in function vif.

Related Question