Solved – P-Values decrease when additional significant variables added (multicollinearity?)

multicollinearitymultiple regressionp-valuesmall-samplestatistical significance

I am doing a study for my masters correlating two separate development indicators to election results for the incumbent government. Unfortunately, I was only only able to get 11 years worth of data for these indicators, due to financial restrictions and what is publicly available from the government.

-Running a simple linear regression with the first indicator as the independent variable and election results as the dependent, Indicator 1 was significant with a P value of .02

-Doing the same with Indicator 2, it was also significant, with a P value of .04

-However, when I run a multiple regression with both variables, neither is even close to significant, with P values both above .5

I turned to the internet for help, and came up with two theories:
1) I need a larger sample size to properly run a multiple regression
2) Multicollinearity (something I was not familiar with) is the issue

I did a regression with indicator 1 as the independent and 2 as the dependent, and discovered they are highly correlated with each other. This leads me to believe multicollinearity is my problem.

1) am i right, is it multicollinearity?

2) if so, what (if anything) can I do about it?

3) If not, then what else could be the problem?

I studied regression this year for my masters, but had little statistical knowledge beforehand, so this is somewhat new to me. Any help would be appreciated.

Best Answer

You mentioned that regression is new to you so I'm going to include some detail.


So you've got $[ y_1, y_2, ..., y_{11} ]$, which represent the percentage of the vote in year $1, 2, ..., 11$ received by the incumbent party as the dependent variable. Your working hypothesis is that this dependent variable depends on $P$, personal income change, and $A$, approval, of which you have eleven measurements each. When doing linear regression, the working hypothesis is that $$Y_i = \beta_0 + \beta_1P + \beta_2A + \epsilon_i \iff Y_i = X\beta + \epsilon_i$$ in which $\epsilon_i \sim N(0,\sigma^2)$ and the second formula is equivalent to the first, just in matrix notation (i.e. $X$ has first column of ones, second column $A$ and third column $P$ and $\beta$ is the vector $\beta = [\beta_0 \beta_1 \beta_2$]). Let's talk about this model assumption: do you have any evidence or hunch even to think that an increase in $P$ and an increase $A$ will result in an increase in $Y$? If you don't, then regression is returning a line of best fit which isn't all that informative. If you do believe these quantities to be linearly related or are testing the hypothesis (i.e. $A$ increases and $P$ increases $\implies$ $Y$ increases) then you're after the $\beta$'s which will tell you what combination of $A$ and $P$ make $Y$ increase (how much to weight each variable, basically).

You think that the predictors are correlated, which is possible. Here is an easy walkthrough of how to calculate $r_{PA}$, the correlation between variables $P$ and $A$. If it turns out that they are correlated, then the $X$ matrix above can't easily be inverted due to multicollinearity, which is a lack of independence of the columns of $X$. Mathematically, the estimate for $\beta$ is $$\hat{\beta} = (X^{T}X)^{-1}Y$$ So if you have multicollinearity, it means that $(X^{T}X)$ is close to not being invertible (which happens when the columns of $X$ are not independent), meaning that your estimate for the weights, $\beta$, won't exist. That's bad. There are methods for ameliorating this using variance inflating factors, but given that you only have 11 data, the best solution is simply to choose which predictor about whose relationship to $Y$ you want to make claims. Then, run regression with just the one predictor so your model looks like $$Y_i = \beta_0 + \beta_1X_{predictor} + \epsilon_i$$ (in which $\epsilon_i \sim N(0, \sigma^2)$ and $predictor = A$ or $P$). Now, the $p$-value that is displayed is the probability that that $\beta_1 = 0$ i.e. that the chosen variable has no effect on $Y$. Loosely put, a $p$-value of 0.05 indicates that the probability, given these 11 data, that $\beta_1 = 0$ is 0.05.