Solved – How to interpret PCA coefficients to reduce dimension

dimensionality reductionMATLABpca

I have read about similar questions. I have data which has 68 columns and about 800 samples. The 68. column is the output the rest 67 is the input variables. I want to reduce the size of my input variables to for example 30 or 20 variables.

I have read about PCA. I already ran the PCA in Matlab and gathered a 67 x 20 matrix containing PCA coefficients. I calculated eigenvalues for each Principal component (10 eigenvalues). As far as I understand I should order these eigenvalues and select the PCA's with higher eigenvalues as important. For example, I chose PC1, PC3, and PC9.

How can I use this information to select among the original 67 variables? I mean how can I use this PCA analysis results to reduce a 67*800 matrix to a 20*800 matrix and get the variables which have higher effect on the target variable?

Best Answer

The eigenvalues that you get in Matlab with pca() are already in decreasing order, so you are guaranteed that the first n Principal Components will be the most important ones. You can see this yourself if you look at a vector latent which is a vector of eigenvalues. If then you would like to reduce your variable set to 20 variables, you can simply retain only 20 first PCs from the coeff matrix.

What you do is you project your original data set $X$ (which is 800 $\times$ 67) on a reduced PC basis:

Z = X * coeff(:,1:20)

to get the PC scores matrix $Z$ which is the representation of your original data in a 20-dimensional space instead of 67-dimensional. The matrix $Z$ is size 800 $\times$ 20.