I am doing a multiple regression, trying to test the extent to which personal income changes and Presidential popularity can predict election results. I have a small sample size, unfortunately, as the country I am studying only has data for the last 11 elections. Both of my independent variables are correlated separately to election results, with P values <.05. However, when I fit a multiple regression with both variables, neither is significant anymore. I assumed this was due to multicollinearity, but I asked SPSS for the VIF and they were both around 4. How should I interpret this?
Solved – How to interpret a VIF of 4
multicollinearitymultiple regressionp-valuestatistical significancevariance-inflation-factor
Related Solutions
The key problem is not correlation but collinearity (see works by Belsley, for instance). This is best tested using condition indexes (available in R
, SAS
and probably other programs as well. Correlation is neither a necessary nor a sufficient condition for collinearity. Condition indexes over 10 (per Belsley) indicate moderate collinearity, over 30 severe, but it also depends on which variables are involved in the collinearity.
If you do find high collinearity, it means that your parameter estimates are unstable. That is, small changes (sometimes in the 4th significant figure) in your data can cause big changes in your parameter estimates (sometimes even reversing their sign). This is a bad thing.
Remedies are
- Getting more data
- Dropping one variable
- Combining the variables (e.g. with partial least squares) and
- Performing ridge regression, which gives biased results but reduces the variance on the estimates.
You mentioned that regression is new to you so I'm going to include some detail.
So you've got $[ y_1, y_2, ..., y_{11} ]$, which represent the percentage of the vote in year $1, 2, ..., 11$ received by the incumbent party as the dependent variable. Your working hypothesis is that this dependent variable depends on $P$, personal income change, and $A$, approval, of which you have eleven measurements each. When doing linear regression, the working hypothesis is that $$Y_i = \beta_0 + \beta_1P + \beta_2A + \epsilon_i \iff Y_i = X\beta + \epsilon_i$$ in which $\epsilon_i \sim N(0,\sigma^2)$ and the second formula is equivalent to the first, just in matrix notation (i.e. $X$ has first column of ones, second column $A$ and third column $P$ and $\beta$ is the vector $\beta = [\beta_0 \beta_1 \beta_2$]). Let's talk about this model assumption: do you have any evidence or hunch even to think that an increase in $P$ and an increase $A$ will result in an increase in $Y$? If you don't, then regression is returning a line of best fit which isn't all that informative. If you do believe these quantities to be linearly related or are testing the hypothesis (i.e. $A$ increases and $P$ increases $\implies$ $Y$ increases) then you're after the $\beta$'s which will tell you what combination of $A$ and $P$ make $Y$ increase (how much to weight each variable, basically).
You think that the predictors are correlated, which is possible. Here is an easy walkthrough of how to calculate $r_{PA}$, the correlation between variables $P$ and $A$. If it turns out that they are correlated, then the $X$ matrix above can't easily be inverted due to multicollinearity, which is a lack of independence of the columns of $X$. Mathematically, the estimate for $\beta$ is $$\hat{\beta} = (X^{T}X)^{-1}Y$$ So if you have multicollinearity, it means that $(X^{T}X)$ is close to not being invertible (which happens when the columns of $X$ are not independent), meaning that your estimate for the weights, $\beta$, won't exist. That's bad. There are methods for ameliorating this using variance inflating factors, but given that you only have 11 data, the best solution is simply to choose which predictor about whose relationship to $Y$ you want to make claims. Then, run regression with just the one predictor so your model looks like $$Y_i = \beta_0 + \beta_1X_{predictor} + \epsilon_i$$ (in which $\epsilon_i \sim N(0, \sigma^2)$ and $predictor = A$ or $P$). Now, the $p$-value that is displayed is the probability that that $\beta_1 = 0$ i.e. that the chosen variable has no effect on $Y$. Loosely put, a $p$-value of 0.05 indicates that the probability, given these 11 data, that $\beta_1 = 0$ is 0.05.
Best Answer
When you estimate a regression equation $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$, where in your case $y$ is the election result, $x_1$ is personal income and $x_2$ is presidential popularity, then, when the 'usual' assumptions are fullfilled, the estimated coefficients $\hat{\beta}_i$ are random variables (i.e. with another sample you will get other estimates) that have a normal distribution with mean the 'true' but unknown $\beta_i$ and a standard deviation that can be computed from the sample. i.e. $\hat{\beta}_i \sim N(\beta_i;\sigma_{\hat{\beta}_i})$. (I am assuming here that the standard deviation of the error term $\epsilon$ is known, the reasoning does not change when it is unknown but the normal distribution is no longer applicable then, and one should use the t-distribution).
If one wants to test whether a coefficient $\beta_i$ is significant, then one performs the statistical hypothesis test $H_0: \beta_i=0$ versus $H_1: \beta_i \ne 0$.
If $H_0$ is true, then the estimator $\hat{\beta}_i$ follows (see supra) a normal distribution with mean 0 and the standard deviation as supra, i.e. if $H_0$ is true then $\hat{\beta}_i \sim N(0;\sigma_{\hat{\beta}_i})$.
The value for $\bar{\beta}_i$ that we compute from our sample comes from this distribution, therefore $\frac{|\bar{\beta}_i - 0|}{\sigma_{\hat{\beta}_i}}$ is an outcome of a standard normal random variable. So for a significance level $\alpha$ we will reject the $H_0$ whenever $\frac{|\bar{\beta}_i | }{\sigma_{\hat{\beta}_i}} \ge z_{\frac{\alpha}{2}}$
If there is correlation between your independent variables $x_1$ and $x_2$ then it can be shown that $\sigma_{\hat{\beta}_i}$ will be larger than when $x_1$ and $x_2$ are uncorrelated. Therefore, if $x_1$ and $x_2$ are correlated the null hypothesis will be 'more difficult to reject' because of the higher denominator.
The Variance Inflating Factor (VIF) tells you how much higher the variance $\sigma_{\hat{\beta}_i}$ are when $x_1$ and $x_2$ are correlated compared to when they are uncorrelated. In your case, the variance is higher by a factor four.
High VIFs are a sign of multicollinearity.
EDIT: added because of the question in your comment:
If you want it in simple words, but less precise, then I think that you have some correlation between the two independent variables personal income ($x_1$) an president's popularity ($x_2$) (but you also have as you say a limited sample). Can you compute their correlation ?
If $x_1$ and $x_2$ are strongly correlated then that means that they 'move together'. What linear regression tries to do is to ''assign'' a change in the dependent variable $y$ to either $x_1$ or $x_2$. Obviously, if both 'move together' (because of high correlation) then it will be difficult to 'decide' which of the $x$'s is 'responsible' for the change in $y$ (because they both change). Therefore the estimates of the $\beta_i$ coefficients will be less precise.
A VIF of four means that the variance (a measure of imprecision) of the estimated coefficients is four times higher because of correlation between the two independent variables.
If your goal is to predict the election results, then multicollinearity is not necessarily a problem, if you want to analyse the impact of e.g. the personal income on the results, then there may be a problem because the estimates of the coefficients are imprecise (i.e. if you would estimate them with another sample then they may change a lot).