This is my first question here, so im sorry if the information seems messy.
I have a dataset with 4 variables and 25 observations. I want to do a regression where Var1 is explained by Var2, Var3 and Var4.
Var2 and Var3 have 0,97 correlation
Var2 and Var4 have 0,48 correlation
Var3 and Var4 have 0,50 correlation
Here I have an issue with multicollinearity between Var2 and Var3. Im considering using PCA to remove this issue, but Im not sure if you can do this with this few variables.
It should be mentioned that in this regression, Var4 has a very high p-value.
Should I use PCA on all three variables, or just on Var2 and Var3 keeping var4 outside?
Could another solution be to drop one of the correlating variables (var2 or var3)?
This may be an unrelated question to the topic, but when Var4 has such a high p-value, and make minimal difference on R-squared, should it be omitted anyways?
Thanks in advance.
Best Answer
Sounds like you've got this problem: \begin{eqnarray} \mathrm{var_1} &=& y\\ \langle \mathrm{var_2,var_3,var_4} \rangle &=& \vec{x} \end{eqnarray}
You're data matrix isn't very large, just 25 x 3:
\begin{eqnarray} X &=& \left( \begin{array}{c} \vec{x}_1 \\ \vec{x}_2 \\ \dots \\ \vec{x}_{25}\end{array}\right) \end{eqnarray}
with a covariance matrix something like this:
\begin{eqnarray} \rho_{ij} = \Sigma_{ij}/\Sigma_{ii}&=& \left( \begin{array}{c} 1 & .97 & .48 \\ .97 & 1 & .5 \\ .48 & .5 & 1\end{array}\right) \end{eqnarray}
If the first two components of your predicting variables -- $var_2$ and $var_3$ -- are so correlated, you've got a couple choices: