If I have 50 variables in my PCA, I get a matrix of eigenvectors and eigenvalues out (I am using the MATLAB function eig
).
I have normalised the eigenvalues to sum to 1, and they are returned already sorted by magnitude. I just want to know how to match them to the variables, by looking at the matrix of eigenvectors. I know the largest eigenvector coefficient corresponds to the largest eigenvalue, but is that the largest absolute value, or is it always the largest positive value, so closest to positive infinity?
From the image below, you can see in the command window the normalised eigenvalues (64.39%, 15.01% …) and in the window at the top there is a part of the corresponding eigenvector matrix. You can see there are positive and negative values. I hope that helps make my question clear.
Here is an example from the MATLAB website. Which is the principal component of each of those 4 column vectors?
Example
load hald
covx = cov(ingredients);
[COEFF,latent,explained] = pcacov(covx)
COEFF =
0.0678 -0.6460 0.5673 -0.5062
0.6785 -0.0200 -0.5440 -0.4933
-0.0290 0.7553 0.4036 -0.5156
-0.7309 -0.1085 -0.4684 -0.4844
latent =
517.7969
67.4964
12.4054
0.2372
explained =
86.5974
11.2882
2.0747
0.0397
Best Answer
You appear to be assuming that the largest eigenvalue necessarily can be paired with the largest coefficient within the eigenvectors. That would be wrong.
The question clearly transcends software choice. Here is a fairly silly PCA on five measures of car size using Stata's auto dataset. I used a correlation matrix as starting point, the only sensible option given quite different units of measurement.
(Output truncated.)
The first component picks up on the fact that as all variables are measures of size, they are well correlated. So to first approximation the coefficients are equal; that's to be expected when all the variables hang together. The remaining components in effect pick up the idiosyncratic contribution of each of the original variables. That is not inevitable, but it works out quite simply for this example. But, to your point, you can see that the largest coefficients, say those above 0.7 in absolute value, are associated with components 2 to 5. There is nothing to stop the largest coefficient being associated with the last component.
(UPDATE) The eigenvectors are informative, but it is also helpful to calculate the components themselves as new variables and then look at their correlations with the original variables. Here they are:
Here
trunk
is the variable most strongly correlated withpc3
, but negatively. A story on why that happens would depend on looking at the data and the PCs. I don't care enough about the example to do that here, but it would be good practice.Although I produced the example with a little prior thought, and it is suitable for your question, and it is based on real data, it is also salutary: interpreting the PCs may be no easier than interpreting something more direct such as scatter plots and correlations. However, not every PCA application depends on ability to interpret PCs as having substantive meaning, and much of the literature warns against doing that in any case. For some purposes, the whole point is a mechanistic reordering of the information in the data.