Machine Learning – Why Use Dimensionality Reduction Despite Reduced Explained Variation?

machine learningmulticollinearity

Let's say I have $N$ covariates in my regression model, and they explain 95% of the variation of the target set, i.e. $r^2=0.95$. If there are multicollinearity between these covariates and PCA is performed to reduce the dimensionality. If the principal components explains, say 80% of the variation (as opposed to 95%), then I have incurred some loss in the accuracy of my model.

Effectively, if PCA solves the issue of multicollinearity at the cost of accuracy, is there any benefit to it, other than the fact it can speed up model training and can reduce collinear covariates into statistically independent and robust variables?

Best Answer

Your question is implicitly assuming that reducing explained variation is necessarily a bad thing. Recall that $R^2$ is defined as: $$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$ where $SS_{res} = \sum_{i}{(y_i - \hat{y})^2}$ is a residual sum of squares and $SS_{tot} = \sum_{i}{(y_i - \bar{y})^2}$ is a total sum of squares. You can easily get an $R^2 = 1$ (i.e. $SS_{res} = 0$) by fitting a line that passes through all of the (training) points (though this, in general, requires more flexible model as opposed to a simple linear regression, as noted by Eric), which is a perfect example of overfitting. So reducing explained variation isn't necessarily bad as it could result in a better performance on unseen (test) data. PCA can be a good preprocessing technique if there are reasons to believe that the dataset has an intrinsic lower-dimensional structure.

Related Question