Machine Learning – What Is Principal Subspace in Probabilistic PCA?

eigenvalueslatent-variablemachine learningpca

if $X$ is observed data matrix and $Y$ is latent variable then

$$X=WY+\mu+\epsilon$$

Where $\mu$ is the mean of observed data, and $\epsilon$ is the Gaussian error/noise in data, and $W$ is called principal subspace.

My question is when normal PCA is used we would get a set of orthonormal eigenvectors $E$ for which following is true

$$Y=EX$$

But in PPCA, $W$ is neither orthonormal nor eigenvectors. So how can I get principal components from $W$?

Following my instinct, I searched for ppca in MATLAB, where I came across this line:

At convergence, the columns of W spans the subspace, but they are not orthonormal. ppca obtains the orthonormal coefficients, coeff, for the components by orthogonalization of W.

I modified ppca code a little to get the W, ran it and after orthogonalization I did get P from W.

Why this orthogonalization gave eigenvectors, along which most of variance will be seen?

I am assuming, orthogonalization is giving me a set of orthogonal/orthonormal vectors which span principal subspace, but why this orthogonalized resultant matrix is equal to eigenmatrix(I know that eigenmatrix in pca also orthonormal)? Can I assume principal subspace is spanned only by a unique set of orthonormal vectors? In that case both result will coincide always.

Best Answer

This is an excellent question.

Probabilistic PCA (PPCA) is the following latent variable model \begin{align} \mathbf z &\sim \mathcal N(\mathbf 0, \mathbf I) \\ \mathbf x &\sim \mathcal N(\mathbf W \mathbf z + \boldsymbol \mu, \sigma^2 \mathbf I), \end{align} where $\mathbf x\in\mathbb R^p$ is one observation and $\mathbf z\in\mathbb R^q$ is a latent variable vector; usually $q\ll p$. Note that this differs from factor analysis in only one little detail: error covariance structure in PPCA is $\sigma^2 \mathbf I$ and in FA it is an arbitrary diagonal matrix $\boldsymbol \Psi$.

Tipping & Bishop, 1999, Probabilistic Principal Component Analysis prove the following theorem: the maximum likelihood solution for PPCA can be obtained analytically and is given by (Eq. 7): $$\mathbf W_\mathrm{ML} = \mathbf U_q (\boldsymbol \Lambda_q - \sigma_\mathrm{ML}^2 \mathbf I)^{1/2} \mathbf R,$$ where $\mathbf U_q$ is a matrix of $q$ leading principal directions (eigenvectors of the covariance matrix), $\boldsymbol \Lambda_q$ is the diagonal matrix of corresponding eigenvalues, $\sigma_\mathrm{ML}^2$ is also given by an explicit formula, and $\mathbf R$ is an arbitrary $q\times q$ rotation matrix (corresponding to rotations in the latent space).

The ppca() function implements expectation-maximization algorithm to fit the model, but we know that it must converge to the $\mathbf W_\mathrm{ML}$ as given above.

Your question is: how to get $\mathbf U_q$ if you know $\mathbf W_\mathrm{ML}$.

The answer is that you can simply use singular value decomposition of $\mathbf W_\mathrm{ML}$. The formula above is already of the form orthogonal matrix times diagonal matrix times orthogonal matrix, so it gives the SVD, and as it is unique, you will get $\mathbf U_q$ as left singular vectors of $\mathbf W_\mathrm{ML}$.

That is exactly what Matlab's ppca() function is doing in line 305:

% Orthogonalize W to the standard PCA subspace
[coeff,~] = svd(W,'econ');

Can I assume principal subspace is spanned only by a unique set of orthonormal vectors?

No! There is an infinite number of orthogonal bases spanning the same principal subspace. If you apply some arbitrary orthogonalization process to $\mathbf W_\mathrm{ML}$ you are not guaranteed to obtain $\mathbf U_q$. But if you use SVD or something equivalent, then it will work.

Related Solutions

Solved – Using PCA for detecting similar regions in an image

It's to be expected that "copied" blocks are almost equal (and more so after the PCA manipulation), so in the lexicographical sort (warning: it's understood that this lexicographic order orders first the most principal component, and so on) "copied" blocks should appear adjacent or near (the reverse is not true: adjacent lexicographicly sorted elements are not necessarily copied, nor even similar)

Here I made up a very simple example myself, in Octave, with a unidimensional signal (y) of size N=200, which has a portion of it copied (here, from 20-50 to 150-180) and a little noise added. I take a small block size (b=3). I convert to PC, sort the rows in lexicographical order (I append first the original block position in an extra column), and compute the distance between adjacent rows (notice that I'm simplifiying a lot here: I'm not discarding components, nor quantizing them; and I'm considering only adjacent rows, not a neighborhood band). I then look at the histogram of those distance, and the original offset is cleary visible.

N=200;
b=3;
delay=130; 
y = filter([1],[1,-0.8,0.1],rand(1,N)-0.5); % my signal, rather arbitrary
y(20+delay:50+delay) = y(20:50);  % a portion is copied
y += (rand(1,N)-0.5)*0.1; % noise added
yy=[y(1:N-2);y(2:N-1);y(3:N)];  % octave does not have  corrmtx (this is not general in b!)
[PC, Z, W, TSQ] = princomp (yy'); % PCA
Z(:,b+1)=[1:N-2]'; % append original block position, in extra row
Z1=sortrows(Z);  % sort rows lexicographycally
Z2=abs(Z1(1:N-3,b+1)-Z1(2:N-2,b+1));  % compute temporal distances between adjacent rows
histo(Z2); % histogram: should show a peak at delay

Machine Learning – Interpretation Problem of Principal Components as Linear Combinations of Features

I think it is correct to say that "principal components are linear combinations of the original features".

You consider a $m\times n$ data matrix $A$ with $m$ data points in the $n$-dimensional space that has rank $r<n$. And you say that even though the original features are only $r$-dimensional, PCA will extract $n$ principal components, hence the quoted statement cannot be correct.

The problem in your argument is that in this case PCA will not extract $n$ principal components; it will only extract $r$ of them.

The $n\times n$ covariance matrix $C$ will be of rank $r$, meaning that it will have $n-r$ zero eigenvalues. If we do its eigenvector decomposition $C=VSV^\top$, the diagonal matrix $S$ will have these zeros in it. I would only call "principal components" those eigenvectors (columns of $V$) that correspond to non-zero eigenvalues. The ones that correspond to zero eigenvalues do not deserve to be -- and are not -- called principal components.

This resolves the contradiction.

Note that this is not my personal interpretation, it is the standard terminology usage; see e.g. Why are there only $n-1$ principal components for $n$ data points if the number of dimensions is larger or equal than $n$?

Edit in response to the comment: Regarding what exactly is called "principal component" please see my answer here What exactly is called "principal component" in PCA?. Whatever your personal preference is, there are only $r$ principal components if the rank of the data matrix is $r$.

Best Answer

Related Solutions

Solved – Using PCA for detecting similar regions in an image

Machine Learning – Interpretation Problem of Principal Components as Linear Combinations of Features

Related Question