What does it mean for an expression to be an orthogonal projection onto the latent space

linear algebramachine learning

On page 576 in Bishop's PRML, it is stated that

$$
(\mathbf{W}_{ML}^T\mathbf{W}_{ML})^{-1}\mathbf{W}^T_{ML}(\mathbf{x} – \mathbf{\bar{x}})
$$

represents an orthogonal projection of the data point $\mathbf{x}$ onto the latent space.

$\mathbf{W}$ is a $D\times M$ matrix. $\mathbf{x}$ is $D\times 1$. The latent space is $M$-dimensional.

What does it mean that the expression represents an orthogonal projection (and how do we know that it is one) onto the latent space and why is it important?

Best Answer

This question arises in the context of principal component analysis. To keep our notation simple, let's assume that our dataset is centred at the origin, i.e. $\mathbf {\bar x }= 0$. The goal is to approximate a datapoint $\mathbf x \in \mathbb R^n$ as closely as possible using a point chosen from the $d$-dimensional subspace spanned by the columns of an $n \times d$ matrix $\mathbf W$. In other words, we wish to find $$ \mathbf x_\star := \mathbf W\mathbf z_\star,$$

where $$ \mathbf z_{\star}:= {\rm argmin}_{\mathbf z \in \mathbb R^d}\| \mathbf x - \mathbf W \mathbf z\|^2.$$

[This $\mathbf x_\star$ is called the "orthogonal projection" of $\mathbf x$ onto the hyperplane spanned by the columns of $\mathbf W$, because, if computed correctly, $\mathbf x_\star$ will be orthogonal to the displacement vector from $\mathbf x_\star$ to $\mathbf x$. Geometrically, this is quite intuitive.]

Let's go ahead and compute this approximation. First, let's find $\mathbf z_\star$, by differentiation: $$ \mathbf 0 = \left( \frac{\partial}{\partial \mathbf z} \| \mathbf x - \mathbf W \mathbf z\|^2 \right)\vert_{{\mathbf z = \mathbf z_\star}} = -2\mathbf W^T(\mathbf x - \mathbf W\mathbf z_\star) \implies \mathbf z_\star = (\mathbf W^T \mathbf W)^{-1}\mathbf W^T \mathbf x.$$

In machine learning, this $\mathbf z_{\star}$ is the latent vector for this datapoint, and corresponds to the expression in your question (assuming $\mathbf {\bar x} = 0$). The approximation $\mathbf x_\star$ is then given by $\mathbf x_\star = \mathbf W \mathbf z_\star$.

[Just for fun, let's verify that $\mathbf x_\star$ and $\mathbf x - \mathbf x - \mathbf x_\star$ are orthogonal, justifying the phrase "orthogonal projection": \begin{align} \mathbf x_\star . (\mathbf x - \mathbf x_\star) &= \mathbf x\mathbf W(\mathbf W^T\mathbf W)^{-1}\mathbf W^T\left( \mathbf x-\mathbf W(\mathbf W^T\mathbf W)^{-1}\mathbf W^T \mathbf x\right) \\ &= \mathbf x\mathbf W(\mathbf W^T\mathbf W)^{-1}\mathbf W^T \mathbf x- \mathbf x\mathbf W(\mathbf W^T\mathbf W)^{-1}\mathbf W^T \mathbf x \\ &= 0. \end{align} ]