Am I Misunderstanding Principal Component Analysis (PCA)

dimensional analysislinear algebraprincipal component analysisstatistics

Principal Component Analysis (PCA) has routinely caused me to question my understanding of mathematics, particularly linear algebra. Once again, PCA is present and I would like to engage to the community to confirm my previous formal education and recent references to refresh my understanding. To put it more succinct, is my interpretation of PCA (below), accurate?

PCA, as a dimension reduction technique is employed to reduce a large data-set (i.e. tons of variables) in to a more coherent and smaller data-set, while maintaining most of the 'principal' information.

This is where I am unsure of my understanding: The outputs from PCA – that is, the principal components are then utilized for further analysis. But to what end?

Let us take a routine example, the Iris data-set. Many programming (R, Python, SPS), focus on this data-set as a practical application of PCA. Note the output from Python's scikit-learn module:
scikit-learn PCA

Understanding that "identifies the combination of attributes..that account for the most variance in the data", what can we do with this output? My interpretation – which I belive is flawed is that – in the case of the PCA output (sckit-learn), is that there is a strong correlation between 'virginica' and 'versicolor'. But is that all? Is this merely a processing technique which is then fed to machine learning models? It does not seem that the outputs from PCA (e.g. PC1, PC2, PC3) could be used for feature reduction. Plotting principal components, what information are we getting – since we are not really outputting what the principal components hold? If one were to 'present' the output in 2-d (as with the 'PCA of IRIS dataset', below) – would the intent to be to show that there is a stronger correlation between 'virginica' and 'versicolor'?

I have interpreted PCA as a pre-processing technique that allows one to identify the 'most important' (weighted, influential, etc.) features. Thus,the output I would expect from PCA would by something more akin to:

  • PC1: Petal Width
  • PC2: Petal Length

Best Answer

Take a point cloud and/or a set of vectors (which are the same thing). Move the cloud to have $0$ mean. For every possible direction, determine the variance of the dataset along that direction. Declare the direction of maximal variance to be the first principal component.

Now, project out that component. The residual data is still mean zero, so compute the new direction of maximal variance and that's the second principal component.

Repeat until all variance has been projected out (or the residual variance is some tiny fraction of the initial variance).

The example shows a common pattern: the dispersion in the direction of the first PC is from $-4$ to $4$. The dispersion in the second PC is from $-1.5$ to $1.5$, already a factor of $3$ reduction. If this pattern continues, the the dispersion in the direction of every two more PCs will fall by about a factor of $10$. Many real datasets only have a small number of PCs before the residuals are very clearly noise.

So, what is the first principal component? It's the direction of maximal variance. What does that have to do with input features like petal width and petal length? Very little, although if these features lead to large dispersion, you should expect the PCAs to incorporate them. If relevant features appear with low correlation, they will largely land in different components. If they appear with high correlation, they will largely land in one component. Thus, PCs make some progress working with features that are not independent. (Since many real world datasets have lots of implicit correlations, only a few PCs are needed to extract most of the dispersion.)

This is linear algebra, so PCs are weighted sums of features. The first PC is a weighted sum of features that points along the direction of maximal variance. You will frequently read "that explains the most variance", but that use of "explains" asserts more causation than is actually present. The second PC is a weighted sum of features pointing along the direction of maximal variance after removing the variance "explained" by the first PC. And so on.

If you are lucky, the first few PCs are nearly parallel to feature axes, so each can be easily described as "capturing this feature" and "capturing that feature". But many datasets have implicit, unexpected, or unrecognized correlations, that lead to components mixing features. For instance, I would expect petal dimensions to be correlated, so I expect those features to appear mixed into one component.

The dataset you graph suggests that the first principal component is sufficient to discriminate setosa from the other categories, so can be discriminated by a very simple linear classifier. The other two categories are not strongly separated by the first two components. However, a new sample having a very extreme second coordinate in the PCs basis might be assignable to one of the versicolor (near $(-1,1.5)$) or virginica (near $(4, -1.5)$) categories.

Related Question