MATLAB: Questions about dimensionality reduction in Matlab using PCA.

classification analysisdimensionality reductioneegpcaStatistics and Machine Learning Toolbox

I am currently trying to use classification analysis for some EEG data. As such data is of very high dimensionality, I am looking at using PCA for dimensionality reduction to prevent overfitting of the classification models. My data structure is approximately 50 (rows, observations) times 38000 (columns, variables). I used the Matlab ‘pca’ function to generate principal components from my variables. I have three questions about this.

First, as stated on the Mathworks website (https://uk.mathworks.com/help/stats/pca.html), rows of the input matrix X should correspond to observations and columns to variables, which is the case for my approach. However, the number of principle components is always equal to rows/observations-1 (I tried using different numbers of rows). Why is this the case? Should it be this way? To me, it would be more intuitive if the number of (maximal) components would be equal to columns/variables-1.

Also, I observed that the sum of the output variable ‘explained’ is always 100, whether I have 5 or 50 principle components. Am I right to assume that this variable therefore does not refer to the proportion of the original data’s variance explained by the principle components but rather reflects the spread of ‘principle component’ variance across individual components? How can I find out the former? That is, how much of my data’s variance is included in the resulting principle components? Or do principle components always reflect the whole variance, no matter how few they might be?

Finally, I understand the ‘scores’ variable so that it reflects my data’s variance, meaning that it can be used analogously to my original data’s variables (e.g. columns). Is this right? Or do I have to project my data back to the original axes after performing PCA and using only a subset of the components? If so, how do I then even reduce input dimensions? I tried ‘reversing’ PCA and I received the same number of variables as before, just with different values in the matrix.

Hope these questions are reasonable and I appreciate any help you can offer. Unfortunately, I was not able to find answers researching the web.

Best wishes.

Best Answer

Trying to answer your questions roughly in the order you asked.

The total number of principal components (PC) is equal to the number of your variables. The PC variables are simply a linear combination of the original variables, projected onto a different set of mutually perpendicular axes.

MATLAB will always generate p principal components, where p is the number of variables (columns) in your data. The transformed data will have the same number of dimensions as the original data. There is no dimensional reduction until you choose to select only a subset of the PCs (thereby accepting some loss of the ability to explain all of the variation).

The explained vector will be length p. It does list how much of your data’s variance is included in the resulting principal components. If you sum up the entire vector, it will sum to 100, as you say. Suppose p = 5 and

   explained = [60; 20; 10; 7; 3]

It sums to 100. But if you decide to only use the first two PCs, then you will have explained 80% of the total variation. Not 100% (as you seem to imply in your question).

For the answer to your last questions about the score variable, take a look at my answer here, which has a detailed example (using more informative variable names). I think I explain it there.

If you search "cyclist" and "PCA" in this forum, you will find some other stuff from me that might be helpful.

Related Solutions

MATLAB: Determining variables that contribute to principal components

The first paragraph in the doc description for princomp says "COEFF is a p-by-p matrix, each column containing coefficients for one principal component." For example, to project your data onto the 1st principal axis, do zscore(aggregate)*coeff(:,1). Why not measure the contribution of a variable to a component by the size of the respective coefficient? Especially since you have standardized your data by zscore.

Since you have 23 components, the columns in score past 23 are filled with zeros. If you need to get the principal component variance, take the 3rd output from princomp.

MATLAB: How to select the components that show the most variance in PCA

Here is some code I wrote to help myself understand the MATLAB syntax for PCA.

rng 'default'
M = 7; % Number of observations
N = 5; % Number of variables observed
% Made-up data
X = rand(M,N);
% De-mean (MATLAB will de-mean inside of PCA, but I want the de-meaned values later)
X = X - mean(X); % Use X = bsxfun(@minus,X,mean(X)) if you have an older version of MATLAB
% Do the PCA
[coeff,score,latent,~,explained] = pca(X);
% Calculate eigenvalues and eigenvectors of the covariance matrix
covarianceMatrix = cov(X);
[V,D] = eig(covarianceMatrix);
% "coeff" are the principal component vectors.
% These are the eigenvectors of the covariance matrix.
% Compare "coeff" and "V". Notice that they are the same,
% except for column ordering and an unimportant overall sign.
coeff
coeff = 5×5
   -0.5173    0.7366   -0.1131    0.4106    0.0919
    0.6256    0.1345    0.1202    0.6628   -0.3699
   -0.3033   -0.6208   -0.1037    0.6252    0.3479
    0.4829    0.1901   -0.5536   -0.0308    0.6506
    0.1262    0.1334    0.8097    0.0179    0.5571
V
V = 5×5
    0.0919    0.4106   -0.1131   -0.7366   -0.5173
   -0.3699    0.6628    0.1202   -0.1345    0.6256
    0.3479    0.6252   -0.1037    0.6208   -0.3033
    0.6506   -0.0308   -0.5536   -0.1901    0.4829
    0.5571    0.0179    0.8097   -0.1334    0.1262
% Multiply the original data by the principal component vectors to get the
% projections of the original data on the principal component vector space.
% % This is also the output "score". Compare ...
dataInPrincipalComponentSpace = X*coeff
dataInPrincipalComponentSpace = 7×5
   -0.5295    0.0362    0.5630    0.1053   -0.0428
    0.2116    0.6573   -0.1721   -0.0306   -0.1559
    0.6427   -0.0017    0.2739   -0.1635    0.2203
   -0.6273    0.0239   -0.3678   -0.0710    0.2214
    0.1332    0.0507   -0.0708    0.2772    0.0398
    0.3145   -0.4825   -0.2080    0.1496   -0.0842
   -0.1451   -0.2840   -0.0182   -0.2670   -0.1987
score
score = 7×5
   -0.5295    0.0362    0.5630    0.1053   -0.0428
    0.2116    0.6573   -0.1721   -0.0306   -0.1559
    0.6427   -0.0017    0.2739   -0.1635    0.2203
   -0.6273    0.0239   -0.3678   -0.0710    0.2214
    0.1332    0.0507   -0.0708    0.2772    0.0398
    0.3145   -0.4825   -0.2080    0.1496   -0.0842
   -0.1451   -0.2840   -0.0182   -0.2670   -0.1987
% The columns of X*coeff are orthogonal to each other.
% This is shown with ...
corrcoef(dataInPrincipalComponentSpace)
ans = 5×5
    1.0000   -0.0000    0.0000   -0.0000   -0.0000
   -0.0000    1.0000    0.0000   -0.0000    0.0000
    0.0000    0.0000    1.0000    0.0000    0.0000
   -0.0000   -0.0000    0.0000    1.0000   -0.0000
   -0.0000    0.0000    0.0000   -0.0000    1.0000
% The variances of these vectors are the eigenvalues of the covariance matrix,
% and are also the output "latent". Compare these three outputs
var(dataInPrincipalComponentSpace)'
ans = 5×1
    0.2116
    0.1250
    0.1009
    0.0357
    0.0286
latent
latent = 5×1
    0.2116
    0.1250
    0.1009
    0.0357
    0.0286
sort(diag(D),'descend')
ans = 5×1
    0.2116
    0.1250
    0.1009
    0.0357
    0.0286

The first figure on the wikipedia page for PCA is really helpful in understanding what is going on. There is variation along the original (x,y) axes. The superimposed arrows show the principal axes. The long arrow is the axis that has the most variation; the short arrow captures the rest of the variation.

Before thinking about dimension reduction, the first step is to redefine a coordinate system (x',y'), such that x' is along the first principal component, and y' along the second component (and so on, if there are more variables).

In my code above, those new variables are dataInPrincipalComponentSpace. As in the original data, each row is an observation, and each column is a dimension.

These data are just like your original data, except it is as if you measured them in a different coordinate system -- the principal axes.

Now you can think about dimension reduction. Take a look at the variable explained. It tells you how much of the variation is captured by each column of dataInPrincipalComponentSpace. Here is where you have to make a judgement call. How much of the total variation are you willing to ignore? One guideline is that if you plot explained, there will often be an "elbow" in the plot, where each additional variable explains very little additional variation. Keep only the components that add a lot more explanatory power, and ignore the rest.

In my code, notice that the first 3 components together explain 87% of the variation; suppose you decide that that's good enough. Then, for your later analysis, you would only keep those 3 dimensions -- the first three columns of dataInPrincipalComponentSpace. You will have 7 observations in 3 dimensions (variables) instead of 5.

I hope that helps!

Best Answer

Related Solutions

MATLAB: Determining variables that contribute to principal components

MATLAB: How to select the components that show the most variance in PCA

Related Question