Solved – PCA using princomp in MATLAB

machine learningMATLABpca

I'm trying to do dimensionality reduction using matlab princomp, but i'm not sure i'm do it right.

here is the my code just for test, but I'm not sure that I'm doing projection right:

A= rand(4,3)
AMean = mean(A)
[n m] = size(A)
Ac= (A - repmat(AMean,[n 1]))
pc= princomp(A)
k=2; %number of first principal components
A_pca= Ac * pc(1:k,:)'  %not sure i'm doing projection right
reconstructedA = A_pca * pc(1:k,:) 
error= reconstructedA- Ac

and my code for face recognition using ORL dataset:

%load orl_data 400x768 double matrix (400 images 768 features)
%make labels
orl_label=[];
for i=1:40
    orl_label= [orl_label;ones(10,1)*i];
end

n=size(orl_data,1);
k = randperm(n);
s= round(0.25*n);% take 25% for train

%raw pix
%split on test and train sets
data_tr = orl_data(k(1:s),:);
label_tr = orl_label(k(1:s),:);
data_te = orl_data(k(s+1:end),:);
label_te = orl_label(k(s+1:end),:);

tic
[nn_ind, estimated_label] = EuclDistClassifier(data_tr,label_tr,data_te);
toc

rate = sum(estimated_label == label_te)/size(label_te,1)

%using pca
tic
pc= princomp(data_tr);
toc

mean_face = mean(data_tr);
pc_n=100;
f_pc= pc(1:pc_n,:)';
data_pca_tr = (data_tr - repmat(mean_face, [s,1])) * f_pc;
data_pca_te = (data_te - repmat(mean_face, [n-s,1])) * f_pc;

tic
[nn_ind, estimated_label] = EuclDistClassifier(data_pca_tr,label_tr,data_pca_te);
toc

rate = sum(estimated_label == label_te)/size(label_te,1)

If I choose enough principal components it give me equal recognition rates, if small number then rate using pca is more poor.

Here is some questions:

  1. Is princomp function the best way to calculate first k principal components using matlab ?
  2. Using PCA projected features vs raw features don't give extra accuracy, but only smaller features vector size?(faster to compare feature vectors).
  3. How to automatically choose min k (number of principal components) that give the same accuracy vs raw feature vector?
  4. What if I have very big set of samples can I use only subset of them with comparable accuracy? or can I compute PCA on some set and later "add" some other set (I don't want to recompute pca for set1+set2, but somehow iteratively add information from set2 to existing PCA from set1)

I also tried GPU version simply using gpuArray:

%test using GPU
tic
A_cpu= rand(30000,32*24);
A= gpuArray(A_cpu);
AMean = mean(A);
[n m] = size(A)
pc= princomp(A);
k=100;
A_pca= (A - repmat(AMean,[n 1])) * pc(1:k,:)';
A_pca_cpu = gather(A_pca);
toc
clear;

tic
A= rand(30000,32*24);
AMean = mean(A);
[n m] = size(A)
pc= princomp(A);
k=100;
A_pca= (A - repmat(AMean,[n 1])) * pc(1:k,:)';
toc
clear;

Working faster, but it's not suitable for big matrices, maybe I'm wrong?

If I use big matrix it gives me

Error using gpuArray Out of memory on device.

Best Answer

Answers to your questions:

  1. I usually use eig(cov(data)) to get the sample eigenvectors but it is matter of personal taste. Probably princomp is better as it gives you the eigenvectors and the projections in one step. If you know $k$ and you want exactly $k$ components it could be easier to use eigs(cov(data),k). Saves you the hassle of computing the higher order components as it returns only the $k$ largest eigenvalues and eigenvectors of the covariance matrix.
  2. Yes, you are mostly using PCA as a dimensionality reduction step. (See next point about accuracy)
  3. Automatic determination of PCA dimensionality is really big subject. Check Minka's work. A nice and quite easily read survey is given by Cangelosi and Goriely here. In short there is no way to have the same data fidelity with your projected data as you would with the original feature vector as by definition you will exclude some modes of variation when ($k < D$). Given that some modes might be just noise though, your classification accuracy might be better; nevertheless in terms of dataset reconstruction you just need to decide of an amount (ie. percentage) of variation you want to retain, you can easily calculate that by the eigenvalue ratios $\frac{\lambda_i}{\Sigma_{i=1}^{k} \lambda_i}$.
  4. I haven't done that before. In general your big set of samples is not a problem; the dimensionality of them is. You can calculate the eigenvectors of 100 million 10-dimensional sample in just seconds. The covariance of a 10 element sample of 10^8 readings/dimensionality each will take you a lot longer. There is work on Online/sequential estimation of PCA according to wikipedia though. Warmuth and Kuzmin's "Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension" seems to investigate just that. (Probably I should also read this paper now that I think of it)

For your final GPU-related question: While I don't work with GPUs extensively myself (and even more I have never used MATLAB for that), I know if you have multiple I/O procedure to the GPU that is a severe computational bottleneck. As such the speed-up you get by the massively parellel computational capabilities of a GPU is nullified by the I/O overhead if you have a small dataset. You'll almost surely get more insightful answers about that point in StackOverflow.

Related Question