Solved – How to prove PCA using induction

dimensionality reductionpca

In Deep Learning (Goodfellow, et al), the optimization objective of PCA is formulated as

$D^* = \arg\min_D ||X – XDD^T||_F^2, s.t. D^T D=I$

The book gives the proof of the 1-dimension case, i.e.

$\arg\min_{d} || X – X dd^T||_F^2, s.t. d^T d = 1 $

equals the eigenvector of $X^TX$ with the largest eigenvalue. And the author says the general case (when $D$ is an $m \times l$ matrix, where $l>1$) can be easily proved by induction.

Could anyone please show me how I can prove that using induction?

I know that when $D^T D = I$:

$
D^*
= \arg\min_D ||X – XDD^T||_F^2
= \arg\min_D tr D^T X^T X D
$

and
$
tr D^T X^T X D = \left(\sum_{i=1}^{l-1} \left(d^{(i)}\right)^T X^TX d^{(i)}\right) + \left(d^{(l)}\right)^T X^TX d^{(l)}
$

where the left-hand side of the addition reaches maximum when $d^{(i)}$ is the $ith$ largest eigenvector of $X^T X$ according to induction hypothesis. But how can I be sure that the result of the addition in a whole is also maximal?

Best Answer

We will start from $$\begin{align} D^* &= \underset{D}{arg\max}\;Tr\ (D^TX^TXD)\\ &= \underset{D}{arg\max}\left[Tr\ (D_{l-1}^TX^TXD_{l-1}) + d^{(l)T}X^TXd^{(l)}\right] \end{align} $$ Where we used the notation $D_{k}$ to denote the matrix with first $l-1$ columns of $D$.

The 2 summands in the expression share no common terms of $D$ and hence can be maximized independently.

Using the induction hypothesis, we conclude that $Tr\ (D_{l-1}^TX^TXD_{l-1})$ (with the constraint that the columns of $D_{l-1}$ are orthonormal) is maximized when $D_{l-1}$ comprises of the orthonormal eigenvectors corresponding the $l-1$ largest eigenvalues.

Notation: Suppose $\lambda_1 \geqslant ... \geqslant\lambda_n$ are the eigenvalues and $v_1, ..., v_n$ are the corresponding orthonormal eigenvectors.
Denote $H_{l-1} = span\{v_1, ...,v_{l-1}\}$ and $H_{l-1}^{\bot}$ the orthogonal subspace of $H_{l-1}$ i.e. $H_{l-1}^{\bot} = span\{v_l,...,v_n\}$

Lemma: $$\begin{align}\lambda_l &= \underset{d^{(l)}}{max}\ d^{(l)T}X^TXd^{(l)} \quad s.t. \Vert d^{(l)}\Vert = 1, d^{(l)} \in H_{l-1}^\bot \\ &=v_l^TX^TXv_l \end{align}$$

Proof: Let $\Sigma = X^TX$. Because it's a symmetric positive semidefinite matrix, eigendecomposition exists and let it be $\Sigma = V\Lambda V^T$ where columns of $V$ are $v_1,...,v_n$ in that order and hence $\Lambda=diag(\lambda_1,...,\lambda_n)$.
$$ \begin{align} d^{(l)T}\Sigma d^{(l)} &= d^{(l)T} V\Lambda V^T d^{(l)}\\ &= q^T \Lambda\ q \qquad [where\ q = V^Td^{(l)}]\\ &= \sum_{i=1}^n q_i^2 \lambda_i \qquad [where\ q_i = (V^Td^{(l)})_i = v_i^T d^{(l)}]\\ &= \sum_{i=l}^n q_i^2 \lambda_i \qquad [\because d^{(l)} \in H_{l-1}^\bot \implies q_i = v_i^T d^{(l)} = 0\ \forall i < l]\\ \end{align} $$ Reminder: $d^{(l)} \in H_{l-1}^\bot$ s.t. $\sum_{k=l}^n \alpha_k V_K; \sum_{k=l} \alpha_k^2 = 1$

Now $$\begin{align} \sum_{i=l}^n q_i^2 &= \sum_{i=1}^n (V_i^T \sum_{k=l}^n \alpha_k V_k)^2 \\ &= \sum_{i=l}^n (\alpha_i V_i^T V_i)^2 \qquad [\because V\ is\ orthogonal] \\ &= \sum_{i=l}^n \alpha^2 = 1 \end{align} $$

Therefore $d^{(l)T} \Sigma d^{(l)}$ is a convex combination of $\lambda_l,...,\lambda_n$ and $$\underset{d^{(l)}}{max}\ d^{(l)T}\Sigma d^{(l)} = \underset{d^{(l)}}{max}\ d^{(l)T}X^TXd^{(l)} = v_l^TX^TXv_l = \lambda_l \ (qed)$$

We conclude that $D^*$ is obtained by augmenting $D_{l-1}$ with the column $v_l$ which completes the original proof.

Related Solutions

Solved – How to apply a Gaussian radial basis function kernel PCA to nonlinear data

The first problem seems to be that the sign of gamma is wrong (it should be negative: $-15$, as in the definition of the kernel, not as in your code). Alternatively, use exp(-gamma * mat_sq_dists).

The second problem is that you clobber the eigenvectors with your invocation of zip's when you sort the list. The $i$-th eigenvector is eigvecs[:,i], not eigvecs[i,:], according to scipy.linalg.eigh (also: you should prefer eigh to eig because you have a symmetric real matrix).

Replace

< gamma = 15
> gamma = -15

and (to get ordered, real eigenvalues)

< eigvals, eigvecs = np.linalg.eig(K)
> eigvals, eigvecs = scipy.linalg.eigh(K)

and

< eigvals, eigvecs = zip(*sorted(zip(eigvals, eigvecs), reverse=True))
< X_pc1 = eigvecs[0]
> X_pc1 = eigvecs[:,99]

Finally, you can examine scikit-learn's own implementation here.

Solved – How to experiment with Lagrange multiplier in PCA optimization

I think I got the answer by myself but wish some experts can confirm.

The confusion is that, in CVX book we are converting one optimization problem with constraints to another optimization problem without constraints and solve the dual problem. But in PCA optimization we cannot.

For example, page 227, we convert

$$ \underset{x}{\text{minimize}}~~ x^\top x \\ \text{s.t.}~~~~~~ Ax=b $$

into maximize the dual function $g(v)=-(1/4)v^\top A A^\top v -b^\top v$, which is

$$ \underset{x}{\text{maximize}}~~\left(-(1/4)v^\top A A^\top v -b^\top v \right)\\ $$

In PCA optimization problem, problem has Lagrangian (for equality constraint we can use $-\lambda$)

$$ \mathcal{L}(\mathbf w,\lambda)=\mathbf w^\top \mathbf{Cw}-\lambda(\mathbf w^\top \mathbf w-1) $$

For fixed $\lambda$, we get partial derivative and set to $\mathbf 0$.

$$ \frac{\partial \mathcal{L}}{\partial \mathbf w}=\mathbf 0=2\mathbf {Cw}-2\lambda\mathbf w $$

which is the eigenvector equation

$$ \mathbf {Cw}=\lambda\mathbf w $$

As pointed out by Matthew Gunn in the comment, PCA problem the objective is not convex see this discussion. Therefore we should not try to minimize dual function to solve the original problem.

Best Answer

Related Solutions

Solved – How to apply a Gaussian radial basis function kernel PCA to nonlinear data

Solved – How to experiment with Lagrange multiplier in PCA optimization

Related Question