Solved – How is PCA applied to new data

eigenvaluesintuitionpca

I understand the basic intuition behind PCA: reducing the dimensionality of data by finding the eigenvectors along which there is most variance in the data, and projecting the data along these eigenvectors (the principal components).

What I don't understand is the following:

How are the eigenvectors found? A standard eigenvector equation is given by $Av=\lambda v$, where $\lambda$ and $v$ are the eigenvalues and eigenvectors respectively. So what is the $A$ matrix – the data itself, or the covariance matrix of the data… or something else? (If the data matrix isnt square then this equation doesn't hold.)
Once PCA has been performed / trained on a data set, can it be applied to reduce the dimensionality of new unseen data? For this to be true, I suppose a mapping would need to be output by PCA, and this mapping could be applied to the new data, say in the form of matrix multiplication.
1. What are the outputs of PCA?
2. How are the outputs applied to new data, if at all?

Best Answer

I will answer each question:

$A$ is indeed the covariance matrix (so $X^TX$ assuming $X$ is standardized)
The output of PCA is 3 things: the vector of column means $\mu$ of $X$, the vector of column stddevs $\sigma$ of $X$ and the rotation matrix $R = [v_1 ... v_p]$. Therefore, for a new sample row $x_0^T$, to compute its projection onto Principal Component Space, you have to standardize and rotate, that is $((x_0 - \mu) / \sigma)^T R$, which will yield a row vector with $x_0$ in PC coordinates. Please note that here I'm dividing by $\sigma$ elementwise.

Related Solutions

Solved – PCA: Eigenvectors of opposite sign and not being able to compute eigenvectors with `solve` in R

1) The definition of eigenvector $Ax = \lambda x$ is ambidextrous. If $x$ is an eigenvector, so is $-x$, for then

$$A(-x) = -Ax = -\lambda x = \lambda (-x)$$

So the definition of an eigenbasis is ambiguous of sign.

2) It's hard to know for sure, but I have a strong suspicion of what is happening here. Your equation

$$ (A - \lambda)x = 0 $$

is technically incorrect. The correct equation is

$$ (A - \lambda I)x$$

The first equation is often used as a shorthand for the second. In general, this is unambiguous, because there is no real mathematical way to subtract a vector from a square matrix, but it is abuse of notation. In R though, you have broadcasting. So if you do

> M <- matrix(c(1, 1, 1, 1), nrow=2)
> M - .5
     [,1] [,2]
[1,]  0.5  0.5
[2,]  0.5  0.5

its not really what you want. The proper way would be

> M - diag(.5, 2)
     [,1] [,2]
[1,]  0.5  1.0
[2,]  1.0  0.5

The reason you are getting zero solutions is that the matrix you are starting with $A$ is invertible. More than likely (almost surely), the matrix you get by subtracting the same number from every entry will also be invertible. For invertible matrices, the only solution to $Ax = 0$ is the zero vector.

Solved – Why does the direction with highest eigenvalue have the largest semi-axis

I think there are two ellipses that we could consider. First, consider the image of the unit circle with respect to the map $x \mapsto x^T A x$ for PD $A \in \mathbb R^{n \times n}$. It is a standard result that $f(x) = x^T A x$ is maximized over unit vectors $x$ by the unit eigenvector $v_1$ with largest eigenvalue $\lambda_1$. So this means that the ellipse formed by the image of the unit circle with respect to this map has its largest semi-axis as $v_1$ with length $\lambda_1$, and so on for the other eigenpairs. So in this case we clearly have that the eigenvalues give the lengths of the semi-axes and the biggest semi-axis is for the biggest eigenvalue.

But now consider the contour $f(x) = 1$ for any $x \in \mathbb R^n$. Since $A$ is positive-definite we know that $f$ is a paraboloid, so its intersection with a horizontal plane (i.e. the contour) is an ellipse. In this case, we find that the shortest semi-axis is parallel to $v_1$, which makes sense because that's the direction that $f$ grows the fastest so we hit 1 the soonest. The largest semi-axis is $v_n$ since that's the direction in which $f$ is growing the slowest. Plugging $v_1$ in to $f$ we get $f(v_1) = v_1^T A v_1 = \lambda_1 v_1^T v_1 = \lambda_1$, not 1 as required, so the actual vector parallel to $v_1$ is $\frac{1}{\sqrt \lambda_1}v_1$. Does this help?

Bringing this back to PCA, let's say that our data consists of $m$ observations coming iid from $\mathcal N_2(\vec 0, \Sigma)$. Let's draw an ellipse around our data such that every point in the ellipse has likelihood greater than some cutoff $c$. This corresponds to a contour of the likelihood and can be found by $$ c = \frac{1}{2\pi \vert \Sigma \vert}\exp \left( -\frac12 x^T \Sigma^{-1} x\right) \iff x^T \Sigma^{-1} x = -2\log (2\pi c \vert \Sigma \vert) $$ i.e. the ellipse that circles the data is a contour of the quadratic form $g(x) = x^T \Sigma^{-1} x$. Note that $\Sigma^{-1}v = \lambda v \implies \frac1\lambda v = \Sigma v$, so the eigenvectors defining the axes of this likelihood contour ellipse are the same as those for the covariance matrix $\Sigma$ but with inverted eigenvalues.

Best Answer

Related Solutions

Solved – PCA: Eigenvectors of opposite sign and not being able to compute eigenvectors with `solve` in R

Solved – Why does the direction with highest eigenvalue have the largest semi-axis

Related Question