Solved – How to interpret an intercept VIF

multicollinearityregressionvariance-inflation-factor

I ran a multiple regression with six independent variables (A-F) and an Intercept. None of the independent variables has a VIF to worry about but the Intercept VIF is way too large:

     **VIF**   **Variable**
0   27.689838  Intercept
1    1.094412   A
2    1.203059   B
3    1.271636   C
4    1.182821   D
5    1.000848   E
6    1.240961   F

How do I interpret such a result? How can the intercept be correlated with another variable and which one should it be, considering that all the other VIFs are perfectly fine? Is such an outcome a showstopper or is there a way to work around it; dropping the intercept certainly isn't…

Best Answer

This shouldn't be a showstopper, vifs seem to be given to much importance anyway. And yes, you are right that the intercept should not be dropped!

You did not give enough context to say much more, but your variable names indicate maybe that your variables are dummys for a factor. Maybe the reference level chosen (that is, the one dummy left out of the model) has very few observations? Or some other particularity of the data. You didn't tell us how you calculated the vif. Notable, the vif() function in R's car package will not calculate vif for the intercept.

Related Solutions

Collinearity Diagnostics – Issues with Interaction Terms

Yes, this is usually the case with non-centered interactions. A quick look at what happens to the correlation of two independent variables and their "interaction"

set.seed(12345)
a = rnorm(10000,20,2)
b = rnorm(10000,10,2)
cor(a,b)
cor(a,a*b)

> cor(a,b)
[1] 0.01564907
> cor(a,a*b)
[1] 0.4608877

And then when you center them:

c = a - 20
d = b - 10
cor(c,d)
cor(c,c*d)

> cor(c,d)
[1] 0.01564907
> cor(c,c*d)
[1] 0.001908758

Incidentally, the same can happen with including polynomial terms (i.e., $X,~X^2,~...$) without first centering.

So you can give that a shot with your pair.

As to why centering helps - but let's go back to the definition of covariance

\begin{align} \text{Cov}(X,XY) &= E[(X-E(X))(XY-E(XY))] \\ &= E[(X-\mu_x)(XY-\mu_{xy})] \\ &= E[X^2Y-X\mu_{xy}-XY\mu_x+\mu_x\mu_{xy}] \\ &= E[X^2Y]-E[X]\mu_{xy}-E[XY]\mu_x+\mu_x\mu_{xy} \\ \end{align}

Even given independence of X and Y

\begin{align} \qquad\qquad\qquad\, &= E[X^2]E[Y]-\mu_x\mu_x\mu_y-\mu_x\mu_y\mu_x+\mu_x\mu_x\mu_y \\ &= (\sigma_x^2+\mu_x^2)\mu_y-\mu_x^2\mu_y \\ &= \sigma_x^2\mu_y \\ \end{align}

This doesn't related directly to your regression problem, since you probably don't have completely independent $X$ and $Y$, and since correlation between two explanatory variables doesn't always result in multicollinearity issues in regression. But it does show how an interaction between two non-centered independent variables causes correlation to show up, and that correlation could cause multicollinearity issues.

Intuitively to me, having non-centered variables interact simply means that when $X$ is big, then $XY$ is also going to be bigger on an absolute scale irrespective of $Y$, and so $X$ and $XY$ will end up correlated, and similarly for $Y$.

Solved – How to interpret a VIF of 4

When you estimate a regression equation $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$, where in your case $y$ is the election result, $x_1$ is personal income and $x_2$ is presidential popularity, then, when the 'usual' assumptions are fullfilled, the estimated coefficients $\hat{\beta}_i$ are random variables (i.e. with another sample you will get other estimates) that have a normal distribution with mean the 'true' but unknown $\beta_i$ and a standard deviation that can be computed from the sample. i.e. $\hat{\beta}_i \sim N(\beta_i;\sigma_{\hat{\beta}_i})$. (I am assuming here that the standard deviation of the error term $\epsilon$ is known, the reasoning does not change when it is unknown but the normal distribution is no longer applicable then, and one should use the t-distribution).

If one wants to test whether a coefficient $\beta_i$ is significant, then one performs the statistical hypothesis test $H_0: \beta_i=0$ versus $H_1: \beta_i \ne 0$.

If $H_0$ is true, then the estimator $\hat{\beta}_i$ follows (see supra) a normal distribution with mean 0 and the standard deviation as supra, i.e. if $H_0$ is true then $\hat{\beta}_i \sim N(0;\sigma_{\hat{\beta}_i})$.

The value for $\bar{\beta}_i$ that we compute from our sample comes from this distribution, therefore $\frac{|\bar{\beta}_i - 0|}{\sigma_{\hat{\beta}_i}}$ is an outcome of a standard normal random variable. So for a significance level $\alpha$ we will reject the $H_0$ whenever $\frac{|\bar{\beta}_i | }{\sigma_{\hat{\beta}_i}} \ge z_{\frac{\alpha}{2}}$

If there is correlation between your independent variables $x_1$ and $x_2$ then it can be shown that $\sigma_{\hat{\beta}_i}$ will be larger than when $x_1$ and $x_2$ are uncorrelated. Therefore, if $x_1$ and $x_2$ are correlated the null hypothesis will be 'more difficult to reject' because of the higher denominator.

The Variance Inflating Factor (VIF) tells you how much higher the variance $\sigma_{\hat{\beta}_i}$ are when $x_1$ and $x_2$ are correlated compared to when they are uncorrelated. In your case, the variance is higher by a factor four.

High VIFs are a sign of multicollinearity.

EDIT: added because of the question in your comment:

If you want it in simple words, but less precise, then I think that you have some correlation between the two independent variables personal income ($x_1$) an president's popularity ($x_2$) (but you also have as you say a limited sample). Can you compute their correlation ?

If $x_1$ and $x_2$ are strongly correlated then that means that they 'move together'. What linear regression tries to do is to ''assign'' a change in the dependent variable $y$ to either $x_1$ or $x_2$. Obviously, if both 'move together' (because of high correlation) then it will be difficult to 'decide' which of the $x$'s is 'responsible' for the change in $y$ (because they both change). Therefore the estimates of the $\beta_i$ coefficients will be less precise.

A VIF of four means that the variance (a measure of imprecision) of the estimated coefficients is four times higher because of correlation between the two independent variables.

If your goal is to predict the election results, then multicollinearity is not necessarily a problem, if you want to analyse the impact of e.g. the personal income on the results, then there may be a problem because the estimates of the coefficients are imprecise (i.e. if you would estimate them with another sample then they may change a lot).

Best Answer

Related Solutions

Collinearity Diagnostics – Issues with Interaction Terms

Solved – How to interpret a VIF of 4

EDIT: added because of the question in your comment:

Related Question