I'm trying to do dimensionality reduction using matlab princomp, but i'm not sure i'm do it right.
here is the my code just for test, but I'm not sure that I'm doing projection right:
A= rand(4,3)
AMean = mean(A)
[n m] = size(A)
Ac= (A - repmat(AMean,[n 1]))
pc= princomp(A)
k=2; %number of first principal components
A_pca= Ac * pc(1:k,:)' %not sure i'm doing projection right
reconstructedA = A_pca * pc(1:k,:)
error= reconstructedA- Ac
and my code for face recognition using ORL dataset:
%load orl_data 400x768 double matrix (400 images 768 features)
%make labels
orl_label=[];
for i=1:40
orl_label= [orl_label;ones(10,1)*i];
end
n=size(orl_data,1);
k = randperm(n);
s= round(0.25*n);% take 25% for train
%raw pix
%split on test and train sets
data_tr = orl_data(k(1:s),:);
label_tr = orl_label(k(1:s),:);
data_te = orl_data(k(s+1:end),:);
label_te = orl_label(k(s+1:end),:);
tic
[nn_ind, estimated_label] = EuclDistClassifier(data_tr,label_tr,data_te);
toc
rate = sum(estimated_label == label_te)/size(label_te,1)
%using pca
tic
pc= princomp(data_tr);
toc
mean_face = mean(data_tr);
pc_n=100;
f_pc= pc(1:pc_n,:)';
data_pca_tr = (data_tr - repmat(mean_face, [s,1])) * f_pc;
data_pca_te = (data_te - repmat(mean_face, [n-s,1])) * f_pc;
tic
[nn_ind, estimated_label] = EuclDistClassifier(data_pca_tr,label_tr,data_pca_te);
toc
rate = sum(estimated_label == label_te)/size(label_te,1)
If I choose enough principal components it give me equal recognition rates, if small number then rate using pca is more poor.
Here is some questions:
- Is princomp function the best way to calculate first k principal components using matlab ?
- Using PCA projected features vs raw features don't give extra accuracy, but only smaller features vector size?(faster to compare feature vectors).
- How to automatically choose min k (number of principal components) that give the same accuracy vs raw feature vector?
- What if I have very big set of samples can I use only subset of them with comparable accuracy? or can I compute PCA on some set and later "add" some other set (I don't want to recompute pca for set1+set2, but somehow iteratively add information from set2 to existing PCA from set1)
I also tried GPU version simply using gpuArray:
%test using GPU
tic
A_cpu= rand(30000,32*24);
A= gpuArray(A_cpu);
AMean = mean(A);
[n m] = size(A)
pc= princomp(A);
k=100;
A_pca= (A - repmat(AMean,[n 1])) * pc(1:k,:)';
A_pca_cpu = gather(A_pca);
toc
clear;
tic
A= rand(30000,32*24);
AMean = mean(A);
[n m] = size(A)
pc= princomp(A);
k=100;
A_pca= (A - repmat(AMean,[n 1])) * pc(1:k,:)';
toc
clear;
Working faster, but it's not suitable for big matrices, maybe I'm wrong?
If I use big matrix it gives me
Error using gpuArray Out of memory on device.
Best Answer
Answers to your questions:
eig(cov(data))
to get the sample eigenvectors but it is matter of personal taste. Probablyprincomp
is better as it gives you the eigenvectors and the projections in one step. If you know $k$ and you want exactly $k$ components it could be easier to useeigs(cov(data),k)
. Saves you the hassle of computing the higher order components as it returns only the $k$ largest eigenvalues and eigenvectors of the covariance matrix.For your final GPU-related question: While I don't work with GPUs extensively myself (and even more I have never used MATLAB for that), I know if you have multiple I/O procedure to the GPU that is a severe computational bottleneck. As such the speed-up you get by the massively parellel computational capabilities of a GPU is nullified by the I/O overhead if you have a small dataset. You'll almost surely get more insightful answers about that point in StackOverflow.