Solved – How to interpret a VIF of 4

multicollinearitymultiple regressionp-valuestatistical significancevariance-inflation-factor

I am doing a multiple regression, trying to test the extent to which personal income changes and Presidential popularity can predict election results. I have a small sample size, unfortunately, as the country I am studying only has data for the last 11 elections. Both of my independent variables are correlated separately to election results, with P values <.05. However, when I fit a multiple regression with both variables, neither is significant anymore. I assumed this was due to multicollinearity, but I asked SPSS for the VIF and they were both around 4. How should I interpret this?

Best Answer

When you estimate a regression equation $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$, where in your case $y$ is the election result, $x_1$ is personal income and $x_2$ is presidential popularity, then, when the 'usual' assumptions are fullfilled, the estimated coefficients $\hat{\beta}_i$ are random variables (i.e. with another sample you will get other estimates) that have a normal distribution with mean the 'true' but unknown $\beta_i$ and a standard deviation that can be computed from the sample. i.e. $\hat{\beta}_i \sim N(\beta_i;\sigma_{\hat{\beta}_i})$. (I am assuming here that the standard deviation of the error term $\epsilon$ is known, the reasoning does not change when it is unknown but the normal distribution is no longer applicable then, and one should use the t-distribution).

If one wants to test whether a coefficient $\beta_i$ is significant, then one performs the statistical hypothesis test $H_0: \beta_i=0$ versus $H_1: \beta_i \ne 0$.

If $H_0$ is true, then the estimator $\hat{\beta}_i$ follows (see supra) a normal distribution with mean 0 and the standard deviation as supra, i.e. if $H_0$ is true then $\hat{\beta}_i \sim N(0;\sigma_{\hat{\beta}_i})$.

The value for $\bar{\beta}_i$ that we compute from our sample comes from this distribution, therefore $\frac{|\bar{\beta}_i - 0|}{\sigma_{\hat{\beta}_i}}$ is an outcome of a standard normal random variable. So for a significance level $\alpha$ we will reject the $H_0$ whenever $\frac{|\bar{\beta}_i | }{\sigma_{\hat{\beta}_i}} \ge z_{\frac{\alpha}{2}}$

If there is correlation between your independent variables $x_1$ and $x_2$ then it can be shown that $\sigma_{\hat{\beta}_i}$ will be larger than when $x_1$ and $x_2$ are uncorrelated. Therefore, if $x_1$ and $x_2$ are correlated the null hypothesis will be 'more difficult to reject' because of the higher denominator.

The Variance Inflating Factor (VIF) tells you how much higher the variance $\sigma_{\hat{\beta}_i}$ are when $x_1$ and $x_2$ are correlated compared to when they are uncorrelated. In your case, the variance is higher by a factor four.

High VIFs are a sign of multicollinearity.

EDIT: added because of the question in your comment:

If you want it in simple words, but less precise, then I think that you have some correlation between the two independent variables personal income ($x_1$) an president's popularity ($x_2$) (but you also have as you say a limited sample). Can you compute their correlation ?

If $x_1$ and $x_2$ are strongly correlated then that means that they 'move together'. What linear regression tries to do is to ''assign'' a change in the dependent variable $y$ to either $x_1$ or $x_2$. Obviously, if both 'move together' (because of high correlation) then it will be difficult to 'decide' which of the $x$'s is 'responsible' for the change in $y$ (because they both change). Therefore the estimates of the $\beta_i$ coefficients will be less precise.

A VIF of four means that the variance (a measure of imprecision) of the estimated coefficients is four times higher because of correlation between the two independent variables.

If your goal is to predict the election results, then multicollinearity is not necessarily a problem, if you want to analyse the impact of e.g. the personal income on the results, then there may be a problem because the estimates of the coefficients are imprecise (i.e. if you would estimate them with another sample then they may change a lot).

Related Question