PCA Cross-Validation MATLAB – How to Determine Number of Principal Components?

cross-validationMATLABpca

I'm trying to write my own function for principal component analysis, PCA (of course there's a lot already written but I'm just interested in implementing stuff by myself). The main problem I encountered is the cross-validation step and calculating predicted sum of squares (PRESS). It doesn't matter which cross-validation I use, it's a question mainly about the theory behind, but consider leave-one-out cross-validation (LOOCV). From the theory I found out that in order to perform LOOCV you need to:

  1. delete an object
  2. scale the rest
  3. perform PCA with some number of components
  4. scale the deleted object according to parameters obtained in (2)
  5. predict the object according to the PCA model
  6. calculate PRESS for this object
  7. re-perform the same algorithm to other objects
  8. sum up all the PRESS values
  9. profit

Because I'm very new in the field, in order to be sure that I'm right, I compare the results with the output from some software I have (also in order to write some code I follow the instructions in the software). I get completely the same results calculating the residual sum of squares and $R^2$, but calculating PRESS is a problem.

Could you please tell to me if what I implement in the cross-validation step is right or not:

case 'loocv'

% # n - number of objects
% # p - number of variables
% # vComponents - the number of components used in CV
dataSets = divideData(n,n); 
         % # it is just a variable responsible for creating datasets for CV 
         % #  (for LOOCV datasets will be equal to [1, 2, 3, ... , n]);'
tempPRESS = zeros(n,vComponents);

for j = 1:n
  Xmodel1 = X; % # X - n x p original matrix
  Xmodel1(dataSets{j},:) = []; % # delete the object to be predicted
  [Xmodel1,Xmodel1shift,Xmodel1div] = skScale(Xmodel1, 'Center', vCenter, 
                                              'Scaling', vScaling); 
          % # scale the data and extract the shift and scaling factor
  Xmodel2 = X(dataSets{j},:); % # the object to be predicted
  Xmodel2 = bsxfun(@minus,Xmodel2,Xmodel1shift); % # shift and scale the object
  Xmodel2 = bsxfun(@rdivide,Xmodel2,Xmodel1div);
  [Xscores2,Xloadings2] = myNipals(Xmodel1,0.00000001,vComponents); 
          % # the way to calculate the scores and loadings
                % # Xscores2 - n x vComponents matrix
                % # Xloadings2 - vComponents x p matrix
  for i = 1:vComponents
    tempPRESS(j,i) = sum(sum((Xmodel2* ...
       (eye(p) - transpose(Xloadings2(1:i,:))*Xloadings2(1:i,:))).^2));
  end
end
PRESS = sum(tempPRESS,1);

In the software (PLS_Toolbox) it works like this:

for i = 1:vComponents
    tempPCA = eye(p) - transpose(Xloadings2(1:i,:))*Xloadings2(1:i,:);
    for kk = 1:p
        tempRepmat(:,kk) = -(1/tempPCA(kk,kk))*tempPCA(:,kk);
          % # this I do not understand
        tempRepmat(kk,kk) = -1; 
          % # here is some normalization that I do not get
    end 
    tempPRESS(j,i) = sum(sum((Xmodel2*tempRepmat).^2)); 
end

So, they do some additional normalization using this tempRepmat variable: the only reason I found was that they apply LOOCV for robust PCA. Unfortunately, support team did not want to answer my question since I have only demo version of their software.

Best Answer

What you are doing is wrong: it does not make sense to compute PRESS for PCA like that! Specifically, the problem lies in your step #5.


Naïve approach to PRESS for PCA

Let the data set consist of $n$ points in $d$-dimensional space: $\mathbf x^{(i)} \in \mathbb R^d, \, i=1 \dots n$. To compute reconstruction error for a single test data point $\mathbf x^{(i)}$, you perform PCA on the training set $\mathbf X^{(-i)}$ with this point excluded, take a certain number $k$ of principal axes as columns of $\mathbf U^{(-i)}$, and find the reconstruction error as $\left \|\mathbf x^{(i)} - \hat{\mathbf x}^{(i)}\right\|^2 = \left\|\mathbf x^{(i)} - \mathbf U^{(-i)} [\mathbf U^{(-i)}]^\top \mathbf x^{(i)}\right\|^2$. PRESS is then equal to sum over all test samples $i$, so the reasonable equation seems to be: $$\mathrm{PRESS} \overset{?}{=} \sum_{i=1}^n \left\|\mathbf x^{(i)} - \mathbf U^{(-i)} [\mathbf U^{(-i)}]^\top \mathbf x^{(i)}\right\|^2.$$

For simplicity, I am ignoring the issues of centering and scaling here.

The naïve approach is wrong

The problem above is that we use $\mathbf x^{(i)}$ to compute the prediction $ \hat{\mathbf x}^{(i)}$, and that is a Very Bad Thing.

Note the crucial difference to a regression case, where the formula for the reconstruction error is basically the same $\left\|\mathbf y^{(i)} - \hat{\mathbf y}^{(i)}\right\|^2$, but prediction $\hat{\mathbf y}^{(i)}$ is computed using the predictor variables and not using $\mathbf y^{(i)}$. This is not possible in PCA, because in PCA there are no dependent and independent variables: all variables are treated together.

In practice it means that PRESS as computed above can decrease with increasing number of components $k$ and never reach a minimum. Which would lead one to think that all $d$ components are significant. Or maybe in some cases it does reach a minimum, but still tends to overfit and overestimate the optimal dimensionality.

A correct approach

There are several possible approaches, see Bro et al. (2008) Cross-validation of component models: a critical look at current methods for an overview and comparison. One approach is to leave out one dimension of one data point at a time (i.e. $x^{(i)}_j$ instead of $\mathbf x^{(i)}$), so that the training data become a matrix with one missing value, and then to predict ("impute") this missing value with PCA. (One can of course randomly hold out some larger fraction of matrix elements, e.g. 10%). The problem is that computing PCA with missing values can be computationally quite slow (it relies on EM algorithm), but needs to be iterated many times here. Update: see http://alexhwilliams.info/itsneuronalblog/2018/02/26/crossval/ for a nice discussion and Python implementation (PCA with missing values is implemented via alternating least squares).

An approach that I found to be much more practical is to leave out one data point $\mathbf x^{(i)}$ at a time, compute PCA on the training data (exactly as above), but then to loop over dimensions of $\mathbf x^{(i)}$, leave them out one at a time and compute a reconstruction error using the rest. This can be quite confusing in the beginning and the formulas tend to become quite messy, but implementation is rather straightforward. Let me first give the (somewhat scary) formula, and then briefly explain it:

$$\mathrm{PRESS_{PCA}} = \sum_{i=1}^n \sum_{j=1}^d \left|x^{(i)}_j - \left[\mathbf U^{(-i)} \left [\mathbf U^{(-i)}_{-j}\right]^+\mathbf x^{(i)}_{-j}\right]_j\right|^2.$$

Consider the inner loop here. We left out one point $\mathbf x^{(i)}$ and computed $k$ principal components on the training data, $\mathbf U^{(-i)}$. Now we keep each value $x^{(i)}_j$ as the test and use the remaining dimensions $\mathbf x^{(i)}_{-j} \in \mathbb R^{d-1}$ to perform the prediction. The prediction $\hat{x}^{(i)}_j$ is the $j$-th coordinate of "the projection" (in the least squares sense) of $\mathbf x^{(i)}_{-j}$ onto subspace spanned by $\mathbf U^{(-i)}$. To compute it, find a point $\hat{\mathbf z}$ in the PC space $\mathbb R^k$ that is closest to $\mathbf x^{(i)}_{-j}$ by computing $\hat{\mathbf z} = \left [\mathbf U^{(-i)}_{-j}\right]^+\mathbf x^{(i)}_{-j} \in \mathbb R^k$ where $\mathbf U^{(-i)}_{-j}$ is $\mathbf U^{(-i)}$ with $j$-th row kicked out, and $[\cdot]^+$ stands for pseudoinverse. Now map $\hat{\mathbf z}$ back to the original space: $\mathbf U^{(-i)} \left [\mathbf U^{(-i)}_{-j}\right]^+\mathbf x^{(i)}_{-j}$ and take its $j$-th coordinate $[\cdot]_j$.

An approximation to the correct approach

I don't quite understand the additional normalization used in the PLS_Toolbox, but here is one approach that goes in the same direction.

There is another way to map $\mathbf x^{(i)}_{-j}$ onto the space of principal components: $\hat{\mathbf z}_\mathrm{approx} = \left [\mathbf U^{(-i)}_{-j}\right]^\top\mathbf x^{(i)}_{-j}$, i.e. simply take the transpose instead of pseudo-inverse. In other words, the dimension that is left out for testing is not counted at all, and the corresponding weights are also simply kicked out. I think this should be less accurate, but might often be acceptable. The good thing is that the resulting formula can now be vectorized as follows (I omit the computation):

$$\begin{align} \mathrm{PRESS_{PCA,\,approx}} &= \sum_{i=1}^n \sum_{j=1}^d \left|x^{(i)}_j - \left[\mathbf U^{(-i)} \left [\mathbf U^{(-i)}_{-j}\right]^\top\mathbf x^{(i)}_{-j}\right]_j\right|^2 \\ &= \sum_{i=1}^n \left\|\left(\mathbf I - \mathbf U \mathbf U^\top + \mathrm{diag}\{\mathbf U \mathbf U^\top\}\right) \mathbf x^{(i)}\right\|^2, \end{align}$$

where I wrote $\mathbf U^{(-i)}$ as $\mathbf U$ for compactness, and $\mathrm{diag}\{\cdot\}$ means setting all non-diagonal elements to zero. Note that this formula looks exactly like the first one (naive PRESS) with a small correction! Note also that this correction only depends on the diagonal of $\mathbf U \mathbf U^\top$, like in the PLS_Toolbox code. However, the formula is still different from what seems to be implemented in PLS_Toolbox, and this difference I cannot explain.

Update (Feb 2018): Above I called one procedure "correct" and another "approximate" but I am not so sure anymore that this is meaningful. Both procedures make sense and I think neither is more correct. I really like that the "approximate" procedure has a simpler formula. Also, I remember that I had some dataset where "approximate" procedure yielded results that looked more meaningful. Unfortunately, I don't remember the details anymore.


Examples

Here is how these methods compare for two well-known datasets: Iris dataset and wine dataset. Note that the naive method produces a monotonically decreasing curve, whereas other two methods yield a curve with a minimum. Note further that in the Iris case, approximate method suggests 1 PC as the optimal number but the pseudoinverse method suggests 2 PCs. (And looking at any PCA scatterplot for the Iris dataset, it does seem that both first PCs carry some signal.) And in the wine case the pseudoinverse method clearly points at 3 PCs, whereas the approximate method cannot decide between 3 and 5.

enter image description here


Matlab code to perform cross-validation and plot the results

function pca_loocv(X)

%// loop over data points 
for n=1:size(X,1)
    Xtrain = X([1:n-1 n+1:end],:);
    mu = mean(Xtrain);
    Xtrain = bsxfun(@minus, Xtrain, mu);
    [~,~,V] = svd(Xtrain, 'econ');
    Xtest = X(n,:);
    Xtest = bsxfun(@minus, Xtest, mu);

    %// loop over the number of PCs
    for j=1:min(size(V,2),25)
        P = V(:,1:j)*V(:,1:j)';        %//'
        err1 = Xtest * (eye(size(P)) - P);
        err2 = Xtest * (eye(size(P)) - P + diag(diag(P)));
        for k=1:size(Xtest,2)
            proj = Xtest(:,[1:k-1 k+1:end])*pinv(V([1:k-1 k+1:end],1:j))'*V(:,1:j)'; 
            err3(k) = Xtest(k) - proj(k);
        end

        error1(n,j) = sum(err1(:).^2);
        error2(n,j) = sum(err2(:).^2);
        error3(n,j) = sum(err3(:).^2);
    end    
end

error1 = sum(error1);
error2 = sum(error2);
error3 = sum(error3);
%// plotting code
figure
hold on
plot(error1, 'k.--')
plot(error2, 'r.-')
plot(error3, 'b.-')
legend({'Naive method', 'Approximate method', 'Pseudoinverse method'}, ...
    'Location', 'NorthWest')
legend boxoff
set(gca, 'XTick', 1:length(error1))
set(gca, 'YTick', [])
xlabel('Number of PCs')
ylabel('Cross-validation error')