Here is some code I wrote to help myself understand the MATLAB syntax for PCA.
[coeff,score,latent,~,explained] = pca(X);
covarianceMatrix = cov(X);
[V,D] = eig(covarianceMatrix);
coeff
coeff =
-0.5173 0.7366 -0.1131 0.4106 0.0919
0.6256 0.1345 0.1202 0.6628 -0.3699
-0.3033 -0.6208 -0.1037 0.6252 0.3479
0.4829 0.1901 -0.5536 -0.0308 0.6506
0.1262 0.1334 0.8097 0.0179 0.5571
V
V =
0.0919 0.4106 -0.1131 -0.7366 -0.5173
-0.3699 0.6628 0.1202 -0.1345 0.6256
0.3479 0.6252 -0.1037 0.6208 -0.3033
0.6506 -0.0308 -0.5536 -0.1901 0.4829
0.5571 0.0179 0.8097 -0.1334 0.1262
dataInPrincipalComponentSpace = X*coeff
dataInPrincipalComponentSpace =
-0.5295 0.0362 0.5630 0.1053 -0.0428
0.2116 0.6573 -0.1721 -0.0306 -0.1559
0.6427 -0.0017 0.2739 -0.1635 0.2203
-0.6273 0.0239 -0.3678 -0.0710 0.2214
0.1332 0.0507 -0.0708 0.2772 0.0398
0.3145 -0.4825 -0.2080 0.1496 -0.0842
-0.1451 -0.2840 -0.0182 -0.2670 -0.1987
score
score =
-0.5295 0.0362 0.5630 0.1053 -0.0428
0.2116 0.6573 -0.1721 -0.0306 -0.1559
0.6427 -0.0017 0.2739 -0.1635 0.2203
-0.6273 0.0239 -0.3678 -0.0710 0.2214
0.1332 0.0507 -0.0708 0.2772 0.0398
0.3145 -0.4825 -0.2080 0.1496 -0.0842
-0.1451 -0.2840 -0.0182 -0.2670 -0.1987
corrcoef(dataInPrincipalComponentSpace)
ans =
1.0000 -0.0000 0.0000 -0.0000 -0.0000
-0.0000 1.0000 0.0000 -0.0000 0.0000
0.0000 0.0000 1.0000 0.0000 0.0000
-0.0000 -0.0000 0.0000 1.0000 -0.0000
-0.0000 0.0000 0.0000 -0.0000 1.0000
var(dataInPrincipalComponentSpace)'
ans =
0.2116
0.1250
0.1009
0.0357
0.0286
latent
latent =
0.2116
0.1250
0.1009
0.0357
0.0286
sort(diag(D),'descend')
ans =
0.2116
0.1250
0.1009
0.0357
0.0286
The first figure on the wikipedia page for PCA is really helpful in understanding what is going on. There is variation along the original (x,y) axes. The superimposed arrows show the principal axes. The long arrow is the axis that has the most variation; the short arrow captures the rest of the variation. Before thinking about dimension reduction, the first step is to redefine a coordinate system (x',y'), such that x' is along the first principal component, and y' along the second component (and so on, if there are more variables).
In my code above, those new variables are dataInPrincipalComponentSpace. As in the original data, each row is an observation, and each column is a dimension.
These data are just like your original data, except it is as if you measured them in a different coordinate system -- the principal axes.
Now you can think about dimension reduction. Take a look at the variable explained. It tells you how much of the variation is captured by each column of dataInPrincipalComponentSpace. Here is where you have to make a judgement call. How much of the total variation are you willing to ignore? One guideline is that if you plot explained, there will often be an "elbow" in the plot, where each additional variable explains very little additional variation. Keep only the components that add a lot more explanatory power, and ignore the rest.
In my code, notice that the first 3 components together explain 87% of the variation; suppose you decide that that's good enough. Then, for your later analysis, you would only keep those 3 dimensions -- the first three columns of dataInPrincipalComponentSpace. You will have 7 observations in 3 dimensions (variables) instead of 5.
I hope that helps!
Best Answer