Solved – Interpreting overlapping arrows on a PCA biplot: does it mean that the variables are redundant


I'm new in principal component analysis (PCA) and I don't really understand the biplot representation of its results, so I would really appreciate some guidance. Having the example of the illustration shown below. Could I say that variables x1 and x2 are telling me the same, and so there is no need to record the values of one of the two?

In my case variables are geometrical deviations of a part where the points measured are close to each other. Therefore, I would like to know if based on the PCA biplot I could stop measuring the values for x2 if I already measured x1 values.

enter image description here

Best Answer

X1 and X2 are "redundant" in the sense of linear duplicates of each other if they correlate perfectly ($r=1$). Then the two variable vectors must coincide, be collinear in the space (that space - where variables are drawn as vectors, arrows - is called "subject space").

But from the plot, without knowing the variable correlations, you can't tell if the vectors coincide in the space - because the plot is only 2-dimensional whereas the space spanned by the four variables is potentially 4-dimensional (or 3-dimensional, in case X1 and X2 do coincide). The plot's plane defined by the first two PCs is the subspace within that 4 (or 3) dimensional space. For what we see as arrows on the plane are just the projections of the true variable vectors on it, shades cast on it by them. What I'm saying is expressed more graphically and with formulas here. Thus, having only your plot, the question whether X1 and X2 coincide or not is yet open.

But, suppose a case that the two PCs explain lion's share of the variability (say, 80% or more of the overall variance). That will mean that the subsequent dimensions (defined by PC3 and PC4) are shallow, so the space is not far to be just the plane you showed. Then the angle between X1 and X2 (which cosine is their correlation) won't be able to be wide. But this is to say that the two variables are not far from being collinear anyway. If so (i.e. little variance left unexpained by PC1+PC2), you may regard X1 and X2 as reasonably redundant.

Finally, if they do coincide or near coincide and therefore redundant for you, could I stop measuring the values for x2 if I already measure x1 values? - you ask. That depends on what you are going to do next after the deletion of one of the two variables. If, for example, you delete the one and go to redo PCA - the PCs will change despite that you deleted a "redundant" measurement.