Solved – Do the principal components change if we apply PCA more than once (recursively) on data

data transformationeigenvaluespcaprojectionsvd

Consider a set $X=(X_1; \dots; X_n)$ of $n$ data points such that $X_i \in \mathbb{R}^d$ is a column vector. Let $Y = \text{pca_proj}(X)$ denote the projection of points in $X$ according to the PCA components i.e.
$$Y_i \in \mathbb{R}^d\\
Y_i = W X_i$$
where $W \in \mathbb{R}^{d \times d}$ is the projection matrix obtained from PCA, and each row of $W$ is an eigen vector of $X^T X$.

Now let's define $Z = \text{pca_proj}(Y)$. My questions is: does $Y = Z$? If not, what happens if we keep applying PCA recursively on a matrix? Does it converge to a specific matrix?

Best Answer

PCA at it's heart involves diagonalizing a matrix which means solving for the eigenvalues and eigenvectors of said matrix. The whole purpose of the calculation is to find a diagonal representation of your matrix (i.e. only elements along the diagonal of the matrix). If you solve again, you will find that you are trying to calculate the eigenvalues of a diagonal matrix, which yields the exact same diagonal matrix. Hence your PC vectors will be the same regardless of how many times you apply the transformation.

Related Solutions

Solved – Using PCA for detecting similar regions in an image

It's to be expected that "copied" blocks are almost equal (and more so after the PCA manipulation), so in the lexicographical sort (warning: it's understood that this lexicographic order orders first the most principal component, and so on) "copied" blocks should appear adjacent or near (the reverse is not true: adjacent lexicographicly sorted elements are not necessarily copied, nor even similar)

Here I made up a very simple example myself, in Octave, with a unidimensional signal (y) of size N=200, which has a portion of it copied (here, from 20-50 to 150-180) and a little noise added. I take a small block size (b=3). I convert to PC, sort the rows in lexicographical order (I append first the original block position in an extra column), and compute the distance between adjacent rows (notice that I'm simplifiying a lot here: I'm not discarding components, nor quantizing them; and I'm considering only adjacent rows, not a neighborhood band). I then look at the histogram of those distance, and the original offset is cleary visible.

N=200;
b=3;
delay=130; 
y = filter([1],[1,-0.8,0.1],rand(1,N)-0.5); % my signal, rather arbitrary
y(20+delay:50+delay) = y(20:50);  % a portion is copied
y += (rand(1,N)-0.5)*0.1; % noise added
yy=[y(1:N-2);y(2:N-1);y(3:N)];  % octave does not have  corrmtx (this is not general in b!)
[PC, Z, W, TSQ] = princomp (yy'); % PCA
Z(:,b+1)=[1:N-2]'; % append original block position, in extra row
Z1=sortrows(Z);  % sort rows lexicographycally
Z2=abs(Z1(1:N-3,b+1)-Z1(2:N-2,b+1));  % compute temporal distances between adjacent rows
histo(Z2); % histogram: should show a peak at delay

Solved – Low-rank SVD reconstruction and linear projection of the data

The confusion is probably due to how the data matrix is organized.

In The Elements of Statistical Learning (and also in the Abdi & Williams paper you linked to), data matrix $\mathbf X$ has data points in rows and variables in columns; so one data vector is a row. Whereas a single data vector $\mathbf x$ is usually understood to be a column vector. So one has to be carefully watching the algebra: if you want to project the data onto an axis $\mathbf v$, you need to write $\mathbf X \mathbf v$, but $\mathbf v^\top \mathbf x$.

Now, if $\mathbf X$ is centered and you do singular value decomposition (SVD) $$\mathbf X = \mathbf {USV}^\top,$$ then COLUMNS of $\mathbf V$ are principal axes (also called principal directions). The first column is the first axis, the second column is the second axis, etc. To project the data onto the first two principal axes, we write $\mathbf X \mathbf V_2$, where $\mathbf V_2$ are the first two columns of $\mathbf V$. The hyperplane spanned by the first two principal axes is spanned by the first two columns of $\mathbf V$. There is no contradiction.

Best Answer

Related Solutions

Solved – Using PCA for detecting similar regions in an image

Solved – Low-rank SVD reconstruction and linear projection of the data

Related Question