Solved – How to apply regression on principal components to predict an output variable

pcaregression

I read about the basics of principal component analysis from tutorial1 , link1 and link2.

I have data set of 100 variables(including output variable Y), I want to reduce the variables to 40 by PCA, and then predict variable Y using those 40 variables.

Problem 1: After getting principal components and choosing first 40 components, if I apply regression on it I get some function which fits the data. But how to predict some variable Y from the original data? To predict variable Y I have (100-1) variables at the input, and how do I know which 40 variables to choose out of my original 100-1 variables?

Problem 2: I do reversing of the PCA and get the data back from those 40 principal components. But the data are changed because I chose only first 40 components. Does applying regression to these data make any sense?

I use Matlab/Octave.

Best Answer

You don't choose a subset of your original 99 (100-1) variables.

Each of the principal components are linear combinations of all 99 predictor variables (x-variables, IVs, ...). If you use the first 40 principal components, each of them is a function of all 99 original predictor-variables. (At least with ordinary PCA - there are sparse/regularized versions such as the SPCA of Zou, Hastie and Tibshirani that will yield components based on fewer variables.)

Consider the simple case of two positively correlated variables, which for simplicity we will assume are equally variable. Then the first principal component will be a (fractional) multiple of the sum of both variates and the second will be a (fractional) multiple of the difference of the two variates; if the two are not equally variable, the first principal component will weight the more-variable one more heavily, but it will still involve both.

So you start with your 99 x-variables, from which you compute your 40 principal components by applying the corresponding weights on each of the original variables. [NB in my discussion I assume $y$ and the $X$'s are already centered.]

You then use your 40 new variables as if they were predictors in their own right, just as you would with any multiple regression problem. (In practice, there's more efficient ways of getting the estimates, but let's leave the computational aspects aside and just deal with a basic idea)

In respect of your second question, it's not clear what you mean by "reversing of the PCA".

Your PCs are linear combinations of the original variates. Let's say your original variates are in $X$, and you compute $Z=XW$ (where $X$ is $n\times 99$ and $W$ is the $99\times 40$ matrix which contains the principal component weights for the $40$ components you're using), then you estimate $\hat{y}=Z\hat{\beta}_\text{PC}$ via regression.

Then you can write $\hat{y}=Z\hat{\beta}_\text{PC}=XW\hat{\beta}_\text{PC}=X\hat{\beta}^*$ say (where $\hat{\beta}^*=W\hat{\beta}_\text{PC}$, obviously), so you can write it as a function of the original predictors; I don't know if that's what you meant by 'reversing', but it's a meaningful way to look at the original relationship between $y$ and $X$. It's not the same as the coefficients you get by estimating a regression on the original X's of course -- it's regularized by doing the PCA; even though you'd get coefficients for each of your original X's this way, they only have the d.f. of the number of components you fitted.

Also see Wikipedia on principal component regression.

Related Question