PCA Objective Function – Connection Between Maximizing Variance and Minimizing Error

optimizationpca

The PCA algorithm can be formulated in terms of the correlation matrix (assume the data $X$ has already been normalized and we are only considering projection onto the first PC). The objective function can be written as:

$$ \max_w (Xw)^T(Xw)\; \: \text{s.t.} \: \:w^Tw = 1. $$

This is fine, and we use Lagrangian multipliers to solve it, i.e. rewriting it as:

$$ \max_w [(Xw)^T(Xw) – \lambda w^Tw], $$

which is equivalent to

$$ \max_w \frac{ (Xw)^T(Xw) }{w^Tw},$$

and hence (see here on Mathworld) seems to be equal to $$\max_w \sum_{i=1}^n \text{(distance from point $x_i$ to line $w$)}^2.$$

But this is saying to maximize the distance between point and line, and from what I've read here, this is incorrect — it should be $\min$, not $\max$. Where is my error?

Or, can someone show me the link between maximizing variance in projected space and minimizing distance between point and line?

Best Answer

Let $\newcommand{\X}{\mathbf X}\X$ be a centered data matrix with $n$ observations in rows. Let $\newcommand{\S}{\boldsymbol \Sigma}\S=\X^\top\X/(n-1)$ be its covariance matrix. Let $\newcommand{\w}{\mathbf w}\w$ be a unit vector specifying an axis in the variable space. We want $\w$ to be the first principal axis.

According to the first approach, first principal axis maximizes the variance of the projection $\X \w$ (variance of the first principal component). This variance is given by the $$\mathrm{Var}(\X\w)=\w^\top\X^\top \X \w/(n-1)=\w^\top\S\w.$$

According to the second approach, first principal axis minimizes the reconstruction error between $\X$ and its reconstruction $\X\w\w^\top$, i.e. the sum of squared distances between the original points and their projections onto $\w$. The square of the reconstruction error is given by \begin{align}\newcommand{\tr}{\mathrm{tr}} \|\X-\X\w\w^\top\|^2 &=\tr\left((\X-\X\w\w^\top)(\X-\X\w\w^\top)^\top\right) \\ &=\tr\left((\X-\X\w\w^\top)(\X^\top-\w\w^\top\X^\top)\right) \\ &=\tr(\X\X^\top)-2\tr(\X\w\w^\top\X^\top)+\tr(\X\w\w^\top\w\w^\top\X^\top) \\ &=\mathrm{const}-\tr(\X\w\w^\top\X^\top) \\ &=\mathrm{const}-\tr(\w^\top\X^\top\X\w) \\ &=\mathrm{const} - \mathrm{const} \cdot \w^\top \S \w. \end{align}

Notice the minus sign before the main term. Because of that, minimizing the reconstruction error amounts to maximizing $\w^\top \S \w$, which is the variance. So minimizing reconstruction error is equivalent to maximizing the variance; both formulations yield the same $\w$.