Solved – PCA on few variables with high correlation

correlationpca

This is my first question here, so im sorry if the information seems messy.

I have a dataset with 4 variables and 25 observations. I want to do a regression where Var1 is explained by Var2, Var3 and Var4.

Var2 and Var3 have 0,97 correlation

Var2 and Var4 have 0,48 correlation

Var3 and Var4 have 0,50 correlation

Here I have an issue with multicollinearity between Var2 and Var3. Im considering using PCA to remove this issue, but Im not sure if you can do this with this few variables.

It should be mentioned that in this regression, Var4 has a very high p-value.

Should I use PCA on all three variables, or just on Var2 and Var3 keeping var4 outside?

Could another solution be to drop one of the correlating variables (var2 or var3)?

This may be an unrelated question to the topic, but when Var4 has such a high p-value, and make minimal difference on R-squared, should it be omitted anyways?

Thanks in advance.

Best Answer

Sounds like you've got this problem: \begin{eqnarray} \mathrm{var_1} &=& y\\ \langle \mathrm{var_2,var_3,var_4} \rangle &=& \vec{x} \end{eqnarray}

You're data matrix isn't very large, just 25 x 3:

\begin{eqnarray} X &=& \left( \begin{array}{c} \vec{x}_1 \\ \vec{x}_2 \\ \dots \\ \vec{x}_{25}\end{array}\right) \end{eqnarray}

with a covariance matrix something like this:

\begin{eqnarray} \rho_{ij} = \Sigma_{ij}/\Sigma_{ii}&=& \left( \begin{array}{c} 1 & .97 & .48 \\ .97 & 1 & .5 \\ .48 & .5 & 1\end{array}\right) \end{eqnarray}

If the first two components of your predicting variables -- $var_2$ and $var_3$ -- are so correlated, you've got a couple choices:

  • Regularize the regression -- Ridge Regression -- with an L1 or L2 norm. (The L1, or "LASSO", will probably just choose between $var_2$ and $var_3$ if sufficiently strong).
  • Do PCA, as you said, and regress on the resulting features (should make it 3 or less) but you'll probably get similar results as to the option above, with a lot more complexity. The features -- and regression coefficients $\vec{\beta}$ -- will no longer be interpretable.
  • Just drop $var_2$ or $var_3$, based on which one has less incremental benefit -- reduction in $\chi^2$ or RSS -- after fitting a model with only $var_4$, then ($var_2$, $var_4$) vs. ($var_3$, $var_4$)
Related Question