[Math] Why eigenvectors with the highest eigenvalues maximize the variance in PCA

eigenvalues-eigenvectorslinear algebralinear-transformationsmachine learningprincipal component analysis

I'm learning Principal Component Analysis (PCA) and came to know that eigenvectors of the covariance matrix of the data are the principal components, which maximizes the variance of the projected data. I understand the intuition behind why we need the variance of projected data as large as possible.

From this answer, I don't understand the following line:

The unit vector $u$ which maximizes variance $u^TΣu$ is nothing but the eigenvector with the largest eigenvalue.

I know how the variance of projected data points is $u^TΣu$ from this answer. But I don't understand why this will be maxed when $u$ is selected as eigenvectors of $u^TΣu$ with the highest eigenvalues?

Intuitively I see eigenvectors as the vectors which stay fixed in their direction under the given linear transformation (values may scale, which are known as eigenvalues). Source: This answer. and this video.

I can't relate why vectors with a fixed direction under given linear transformation give the highest variance? Any intuitive explanation will help! Thanks.

Best Answer

Let the spectral decomposition of $\Sigma$ be $\Sigma=PDP^T,$ where $P$ is orthonormal and $D$ is diagonal. Then $u^T\Sigma u=\displaystyle\sum_{i=1}^d\lambda_i(p_i^Tu)^2,$ where $p_i$ is the $i^{\text{th}}$ column of $P$, in other words, the $i^\text{th}$ eigenvector of $\Sigma.$

We want to find $u$ such that $\displaystyle\sum_{i=1}^d\lambda_i(p_i^Tu)^2$ is maximized. Since $p_i$'s form an orthonormal basis, $\displaystyle\sum_{i=1}^d(p_i^Tu)^2=1.$ Consider the optimization problem: $$\text{Maximize }\displaystyle\sum_{i=1}^d\lambda_iz_i^2\text{ subject to }\sum_{i=1}^dz_i^2=1.$$ Suppose $\lambda_1\ge\lambda_2\ge\dots\ge\lambda_d.$ It is clear that we must have $z_1=1,$ $z_i=0$ for the rest, because otherwise we will have a lower value of the objective function. That would mean $$p_1^Tu=1,\text{ and }p_i^Tu=0\text{ for all }i\neq 1.$$ By equality in Cauchy-Schwarz inequality, $p_1^Tu=1\iff u=c\times p_1,$ for some constant $c.$ By the norm $1$ constraint, $u=p_1.$

Related Question