Proof of the variance of one-dimensional projections

machine learningprincipal component analysisprojection

In Bishop's Pattern Recognition And Machine Learning book, Chapter 12,

Suppose $X$ is an uncentered data matrix and $\bar{x}=\frac{1}{m}\sum_ix_i$ is the sample mean of the columns of $X$.

For the projection onto a one-dimensional space, define an arbitrary vector $u$, which we choose to be a unit vector, so that $u^Tu=1$.

It is said in the book that the variance of the projected data is

$$u^TSu$$

where $S$ is the data covariance matrix defined by

$S = \frac{1}{N}\sum_{n=1}^{N}(x_n-\bar{x})(x_n-\bar{x})^T$

But how would I prove this?

The variance is $\frac{1}{N}\sum_{n=1}^{N}(u^Tx_n-u^T\bar{x})^2$,

But expanding this does not seem to prove it.

Can anyone help me?

Best Answer

The mean of the data $\overline{\mathbf{x}}={1\over N}\sum_{n=1}^N \mathbf{x}_n$ is a vector averaging over all the values of $\mathbf{x}_n$ and $\mathbf{u}_1$ is a unit vector, so $\mathbf{u}_1^T \mathbf{x}_n$ in this case is just the dot product of $\mathbf{u}_1$ (the unit vector in the "principle component" direction) with the vector-valued $\mathbf{x}_n$. The left side of the equation $$ {1\over N}\sum_{n=1}^N\{\mathbf{u}_1^T \mathbf{x}_n-\mathbf{u}_1^T \overline{\mathbf{x}}\}^2=\mathbf{u}_1^T S \mathbf{u_1} $$ is the usual variance in the $\mathbf{u}_1$ direction, which is $\mathbf{u}_1\cdot (\mathbf{x}_n-\overline{\mathbf{x}})$ squared, so so far we just have a scalar squared within that summation. But what Bishop has done is to reverse one of the transpositions from $\mathbf{u}_1^T (\mathbf{x}_n-\overline{\mathbf{x}})$ to $(\mathbf{x}_n -\overline{\mathbf{x}})^T\mathbf{u}_1$, which can obviously be done since it's just a dot product of two vectors, and then take the constant $\mathbf{u}_1^T$ and $\mathbf{u}_1$ terms on the left and right out of the sum:

$$ \mathbf{u}_1^T \left( {1\over N} \sum_{n=1}^N \{\mathbf{x}_n-\overline{\mathbf{x}}\} \{\mathbf{x}_n-\overline{\mathbf{x}}\}^T \right) \mathbf{u}_1\tag{1} $$ and then call all the stuff in the middle $S$.

So there is not much to "prove", he's just creating a new matrix called $S$.

One thing which might be confusing is that $n$ labels the data points, not the vector indices, and $N$ is the number of data points, but $D$ is the dimension of the vectors.

To illustrate, if we let the dimension $D$ be 2, then written out in full (1) looks something like $$ (u_1 u_2) \times \left( {1\over N} \sum_{n=1}^N \left( \begin{matrix} ({x_n}_1-\overline{x}_1) ({x_n}_1-\overline{x}_1) & ({x_n}_1-\overline{x}_1) ({x_n}_2-\overline{x}_2) \\ ({x_n}_2-\overline{x}_2) ({x_n}_1-\overline{x}_1) & ({x_n}_2-\overline{x}_2) ({x_n}_2-\overline{x}_2) \\ \end{matrix} \right) \right) \times \left( \begin{matrix} u_1\\ u_2\\ \end{matrix} \right) $$

Best Answer

Related Solutions

[Math] Why eigenvectors with the highest eigenvalues maximize the variance in PCA

Related Question