Solved – How to prove PCA using induction

dimensionality reductionpca

In Deep Learning (Goodfellow, et al), the optimization objective of PCA is formulated as

$D^* = \arg\min_D ||X – XDD^T||_F^2, s.t. D^T D=I$

The book gives the proof of the 1-dimension case, i.e.

$\arg\min_{d} || X – X dd^T||_F^2, s.t. d^T d = 1 $

equals the eigenvector of $X^TX$ with the largest eigenvalue. And the author says the general case (when $D$ is an $m \times l$ matrix, where $l>1$) can be easily proved by induction.

Could anyone please show me how I can prove that using induction?

I know that when $D^T D = I$:

$
D^*
= \arg\min_D ||X – XDD^T||_F^2
= \arg\min_D tr D^T X^T X D
$

and
$
tr D^T X^T X D = \left(\sum_{i=1}^{l-1} \left(d^{(i)}\right)^T X^TX d^{(i)}\right) + \left(d^{(l)}\right)^T X^TX d^{(l)}
$

where the left-hand side of the addition reaches maximum when $d^{(i)}$ is the $ith$ largest eigenvector of $X^T X$ according to induction hypothesis. But how can I be sure that the result of the addition in a whole is also maximal?

Best Answer

We will start from $$\begin{align} D^* &= \underset{D}{arg\max}\;Tr\ (D^TX^TXD)\\ &= \underset{D}{arg\max}\left[Tr\ (D_{l-1}^TX^TXD_{l-1}) + d^{(l)T}X^TXd^{(l)}\right] \end{align} $$ Where we used the notation $D_{k}$ to denote the matrix with first $l-1$ columns of $D$.

The 2 summands in the expression share no common terms of $D$ and hence can be maximized independently.

Using the induction hypothesis, we conclude that $Tr\ (D_{l-1}^TX^TXD_{l-1})$ (with the constraint that the columns of $D_{l-1}$ are orthonormal) is maximized when $D_{l-1}$ comprises of the orthonormal eigenvectors corresponding the $l-1$ largest eigenvalues.

Notation: Suppose $\lambda_1 \geqslant ... \geqslant\lambda_n$ are the eigenvalues and $v_1, ..., v_n$ are the corresponding orthonormal eigenvectors.
Denote $H_{l-1} = span\{v_1, ...,v_{l-1}\}$ and $H_{l-1}^{\bot}$ the orthogonal subspace of $H_{l-1}$ i.e. $H_{l-1}^{\bot} = span\{v_l,...,v_n\}$

Lemma: $$\begin{align}\lambda_l &= \underset{d^{(l)}}{max}\ d^{(l)T}X^TXd^{(l)} \quad s.t. \Vert d^{(l)}\Vert = 1, d^{(l)} \in H_{l-1}^\bot \\ &=v_l^TX^TXv_l \end{align}$$

Proof: Let $\Sigma = X^TX$. Because it's a symmetric positive semidefinite matrix, eigendecomposition exists and let it be $\Sigma = V\Lambda V^T$ where columns of $V$ are $v_1,...,v_n$ in that order and hence $\Lambda=diag(\lambda_1,...,\lambda_n)$.
$$ \begin{align} d^{(l)T}\Sigma d^{(l)} &= d^{(l)T} V\Lambda V^T d^{(l)}\\ &= q^T \Lambda\ q \qquad [where\ q = V^Td^{(l)}]\\ &= \sum_{i=1}^n q_i^2 \lambda_i \qquad [where\ q_i = (V^Td^{(l)})_i = v_i^T d^{(l)}]\\ &= \sum_{i=l}^n q_i^2 \lambda_i \qquad [\because d^{(l)} \in H_{l-1}^\bot \implies q_i = v_i^T d^{(l)} = 0\ \forall i < l]\\ \end{align} $$ Reminder: $d^{(l)} \in H_{l-1}^\bot$ s.t. $\sum_{k=l}^n \alpha_k V_K; \sum_{k=l} \alpha_k^2 = 1$

Now $$\begin{align} \sum_{i=l}^n q_i^2 &= \sum_{i=1}^n (V_i^T \sum_{k=l}^n \alpha_k V_k)^2 \\ &= \sum_{i=l}^n (\alpha_i V_i^T V_i)^2 \qquad [\because V\ is\ orthogonal] \\ &= \sum_{i=l}^n \alpha^2 = 1 \end{align} $$

Therefore $d^{(l)T} \Sigma d^{(l)}$ is a convex combination of $\lambda_l,...,\lambda_n$ and $$\underset{d^{(l)}}{max}\ d^{(l)T}\Sigma d^{(l)} = \underset{d^{(l)}}{max}\ d^{(l)T}X^TXd^{(l)} = v_l^TX^TXv_l = \lambda_l \ (qed)$$

We conclude that $D^*$ is obtained by augmenting $D_{l-1}$ with the column $v_l$ which completes the original proof.

Related Question