Solved – How to interpret PCA coefficients to reduce dimension

dimensionality reductionMATLABpca

I have read about similar questions. I have data which has 68 columns and about 800 samples. The 68. column is the output the rest 67 is the input variables. I want to reduce the size of my input variables to for example 30 or 20 variables.

I have read about PCA. I already ran the PCA in Matlab and gathered a 67 x 20 matrix containing PCA coefficients. I calculated eigenvalues for each Principal component (10 eigenvalues). As far as I understand I should order these eigenvalues and select the PCA's with higher eigenvalues as important. For example, I chose PC1, PC3, and PC9.

How can I use this information to select among the original 67 variables? I mean how can I use this PCA analysis results to reduce a 67*800 matrix to a 20*800 matrix and get the variables which have higher effect on the target variable?

Best Answer

The eigenvalues that you get in Matlab with pca() are already in decreasing order, so you are guaranteed that the first n Principal Components will be the most important ones. You can see this yourself if you look at a vector latent which is a vector of eigenvalues. If then you would like to reduce your variable set to 20 variables, you can simply retain only 20 first PCs from the coeff matrix.

What you do is you project your original data set $X$ (which is 800 $\times$ 67) on a reduced PC basis:

Z = X * coeff(:,1:20)

to get the PC scores matrix $Z$ which is the representation of your original data in a 20-dimensional space instead of 67-dimensional. The matrix $Z$ is size 800 $\times$ 20.

Related Solutions

Solved – PCA output of Matlab’s pca() function doesn’t match manual calculation

The problem is the zscore function. If I do a "manual" z-scoring in my matrix I find the same result as with pca:

M = [10,5,14;12,5,45;123,58,42];
%// "manual" zscore
stdr = std(M);
X = M./repmat(stdr,size(M,1),1);
%// "manual" PCA
V = cov(X);
[U,E] = eig(V);
%// with pca function
[coeff,score,eigenvalue] = pca(X);

E equals eigenvalue and coeff equals U so I'm ok, I think I understand how calculate a PCA.

Solved – Kmeans clustering results on pca dataset reduction

Components are ordered according to how much variability your data display on each of them. So the points on the opposite ends of the first component are farther away from each other compared with data points on the opposite ends of some other component. K-means works by looking at distances between points. When two points are on the opposite end of PC1 projection - their difference is a lot bigger compared to when they are on different ends on, say, PC8 projection. This is not a problem.

Best Answer

Related Solutions

Solved – PCA output of Matlab’s pca() function doesn’t match manual calculation

Solved – Kmeans clustering results on pca dataset reduction

Related Question