Solved – an intuitive explanation for how PCA turns from a geometric problem (with distances) to a linear algebra problem (with eigenvectors)

intuitionlinear algebraoptimizationpca

I've read a lot about PCA, including various tutorials and questions (such as this one, this one, this one, and this one).

The geometric problem that PCA is trying to optimize is clear to me: PCA tries to find the first principal component by minimizing the reconstruction (projection) error, which simultaneously maximizes the variance of the projected data.

enter image description here

When I first read that, I immediately thought of something like linear regression; maybe you can solve it using gradient descent if needed.

However, then my mind was blown when I read that the optimization problem is solved by using linear algebra and finding eigenvectors and eigenvalues. I simply do not understand how this use of linear algebra comes into play.

So my question is: How can PCA turn from a geometric optimization problem to a linear algebra problem? Can someone provide an intuitive explanation?

I am not looking for an answer like this one that says "When you solve the mathematical problem of PCA, it ends up being equivalent to finding the eigenvalues and eigenvectors of the covariance matrix." Please explain why eigenvectors come out to be the principal components and why the eigenvalues come out to be variance of the data projected onto them

I am a software engineer and not a mathematician, by the way.

Note: the figure above was taken and modified from this PCA tutorial.

Best Answer

Problem statement

The geometric problem that PCA is trying to optimize is clear to me: PCA tries to find the first principal component by minimizing the reconstruction (projection) error, which simultaneously maximizes the variance of the projected data.

That's right. I explain the connection between these two formulations in my answer here (without math) or here (with math).

Let's take the second formulation: PCA is trying the find the direction such that the projection of the data on it has the highest possible variance. This direction is, by definition, called the first principal direction. We can formalize it as follows: given the covariance matrix $\mathbf C$, we are looking for a vector $\mathbf w$ having unit length, $\|\mathbf w\|=1$, such that $\mathbf w^\top \mathbf{Cw}$ is maximal.

(Just in case this is not clear: if $\mathbf X$ is the centered data matrix, then the projection is given by $\mathbf{Xw}$ and its variance is $\frac{1}{n-1}(\mathbf{Xw})^\top \cdot \mathbf{Xw} = \mathbf w^\top\cdot (\frac{1}{n-1}\mathbf X^\top\mathbf X)\cdot \mathbf w = \mathbf w^\top \mathbf{Cw}$.)

On the other hand, an eigenvector of $\mathbf C$ is, by definition, any vector $\mathbf v$ such that $\mathbf{Cv}=\lambda \mathbf v$.

It turns out that the first principal direction is given by the eigenvector with the largest eigenvalue. This is a nontrivial and surprising statement.


Proofs

If one opens any book or tutorial on PCA, one can find there the following almost one-line proof of the statement above. We want to maximize $\mathbf w^\top \mathbf{Cw}$ under the constraint that $\|\mathbf w\|=\mathbf w^\top \mathbf w=1$; this can be done introducing a Lagrange multiplier and maximizing $\mathbf w^\top \mathbf{Cw}-\lambda(\mathbf w^\top \mathbf w-1)$; differentiating, we obtain $\mathbf{Cw}-\lambda\mathbf w=0$, which is the eigenvector equation. We see that $\lambda$ has in fact to be the largest eigenvalue by substituting this solution into the objective function, which gives $\mathbf w^\top \mathbf{Cw}-\lambda(\mathbf w^\top \mathbf w-1) = \mathbf w^\top \mathbf{Cw} = \lambda\mathbf w^\top \mathbf{w} = \lambda$. By virtue of the fact that this objective function should be maximized, $\lambda$ must be the largest eigenvalue, QED.

This tends to be not very intuitive for most people.

A better proof (see e.g. this neat answer by @cardinal) says that because $\mathbf C$ is symmetric matrix, it is diagonal in its eigenvector basis. (This is actually called spectral theorem.) So we can choose an orthogonal basis, namely the one given by the eigenvectors, where $\mathbf C$ is diagonal and has eigenvalues $\lambda_i$ on the diagonal. In that basis, $\mathbf w^\top \mathbf{C w}$ simplifies to $\sum \lambda_i w_i^2$, or in other words the variance is given by the weighted sum of the eigenvalues. It is almost immediate that to maximize this expression one should simply take $\mathbf w = (1,0,0,\ldots, 0)$, i.e. the first eigenvector, yielding variance $\lambda_1$ (indeed, deviating from this solution and "trading" parts of the largest eigenvalue for the parts of smaller ones will only lead to smaller overall variance). Note that the value of $\mathbf w^\top \mathbf{C w}$ does not depend on the basis! Changing to the eigenvector basis amounts to a rotation, so in 2D one can imagine simply rotating a piece of paper with the scatterplot; obviously this cannot change any variances.

I think this is a very intuitive and a very useful argument, but it relies on the spectral theorem. So the real issue here I think is: what is the intuition behind the spectral theorem?


Spectral theorem

Take a symmetric matrix $\mathbf C$. Take its eigenvector $\mathbf w_1$ with the largest eigenvalue $\lambda_1$. Make this eigenvector the first basis vector and choose other basis vectors randomly (such that all of them are orthonormal). How will $\mathbf C$ look in this basis?

It will have $\lambda_1$ in the top-left corner, because $\mathbf w_1=(1,0,0\ldots 0)$ in this basis and $\mathbf {Cw}_1=(C_{11}, C_{21}, \ldots C_{p1})$ has to be equal to $\lambda_1\mathbf w_1 = (\lambda_1,0,0 \ldots 0)$.

By the same argument it will have zeros in the first column under the $\lambda_1$.

But because it is symmetric, it will have zeros in the first row after $\lambda_1$ as well. So it will look like that:

$$\mathbf C=\begin{pmatrix}\lambda_1 & 0 & \ldots & 0 \\ 0 & & & \\ \vdots & & & \\ 0 & & & \end{pmatrix},$$

where empty space means that there is a block of some elements there. Because the matrix is symmetric, this block will be symmetric too. So we can apply exactly the same argument to it, effectively using the second eigenvector as the second basis vector, and getting $\lambda_1$ and $\lambda_2$ on the diagonal. This can continue until $\mathbf C$ is diagonal. That is essentially the spectral theorem. (Note how it works only because $\mathbf C$ is symmetric.)


Here is a more abstract reformulation of exactly the same argument.

We know that $\mathbf{Cw}_1 = \lambda_1 \mathbf w_1$, so the first eigenvector defines a 1-dimensional subspace where $\mathbf C$ acts as a scalar multiplication. Let us now take any vector $\mathbf v$ orthogonal to $\mathbf w_1$. Then it is almost immediate that $\mathbf {Cv}$ is also orthogonal to $\mathbf w_1$. Indeed:

$$ \mathbf w_1^\top \mathbf{Cv} = (\mathbf w_1^\top \mathbf{Cv})^\top = \mathbf v^\top \mathbf C^\top \mathbf w_1 = \mathbf v^\top \mathbf {Cw}_1=\lambda_1 \mathbf v^\top \mathbf w_1 = \lambda_1\cdot 0 = 0.$$

This means that $\mathbf C$ acts on the whole remaining subspace orthogonal to $\mathbf w_1$ such that it stays separate from $\mathbf w_1$. This is the crucial property of symmetric matrices. So we can find the largest eigenvector there, $\mathbf w_2$, and proceed in the same manner, eventually constructing an orthonormal basis of eigenvectors.

Related Question