For example, if we have the linear regression model:
$$E(y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 $$
where $ x_1 =\begin{cases} 1 & \mbox{if level 2} \\ 0 & \mbox{otherwise} \end{cases}$ and $x_2, x_3 $ are quantitative.
When checking for multicollinearity, we typically compute the linear regression models for each independent variable as a function of the remaining independent variables:
\begin{align}
E[x_1] &= \alpha_0 + \alpha_2 x_2 + \alpha_3 x_3 \\
E[x_2] &= \alpha_0 + \alpha_1 x_1 + \alpha_3 x_3 \\
E[x_3] &= \alpha_0 + \alpha_2 x_2 + \alpha_2 x_2
\end{align}
And from there, we derive the VIF for each term from
$R^{2}_{i}\, \forall i \in \{1, 2, 3\} $
(just the $R^2$ for the models above).
My problem is with $ E[x_1]$ which is possibly non-sensical, or at least counter-intuitive, as the possible values of $ x_1 $ are only $0$ or $1$. Of course, computing $ E[x_1]$ does make sense mathematically, but is this exactly what we need? Please explain.
Is there a way to calculate the $VIF$ for a categorical variable to check for how it is affected by multicollinearity?
(And an extra thank you if you know of an library in R which calculates the VIF for a linear regression model. Maybe the vif
function?)
Best Answer
The function you requested comes in the package
{car}
in R.I tried to figure it out running some regression models using the
mtcars
package in R.Evidently, I can get the
VIF
both using the function and manually, when the regressor is a continuous variable:Now for the real question, here is what I find. Let's say that your regressor is
am
, which corresponds to the categorical variable for the type of transmission of the car (automatic versus manual).Ordinarily, you would fit a model such as:
The problem is that if you try now to get the VIF for
am
by just reshuffling the regressors you get an error message:Game over? Not quite... Look what happens if I treat
am
as continuous:We get the same value manually as with the R built-in function
vif
.