MATLAB: Questions about dimensionality reduction in Matlab using PCA.

classification analysisdimensionality reductioneegpcaStatistics and Machine Learning Toolbox

Hi
I am currently trying to use classification analysis for some EEG data. As such data is of very high dimensionality, I am looking at using PCA for dimensionality reduction to prevent overfitting of the classification models. My data structure is approximately 50 (rows, observations) times 38000 (columns, variables). I used the Matlab ‘pca’ function to generate principal components from my variables. I have three questions about this.
First, as stated on the Mathworks website (https://uk.mathworks.com/help/stats/pca.html), rows of the input matrix X should correspond to observations and columns to variables, which is the case for my approach. However, the number of principle components is always equal to rows/observations-1 (I tried using different numbers of rows). Why is this the case? Should it be this way? To me, it would be more intuitive if the number of (maximal) components would be equal to columns/variables-1.
Also, I observed that the sum of the output variable ‘explained’ is always 100, whether I have 5 or 50 principle components. Am I right to assume that this variable therefore does not refer to the proportion of the original data’s variance explained by the principle components but rather reflects the spread of ‘principle component’ variance across individual components? How can I find out the former? That is, how much of my data’s variance is included in the resulting principle components? Or do principle components always reflect the whole variance, no matter how few they might be?
Finally, I understand the ‘scores’ variable so that it reflects my data’s variance, meaning that it can be used analogously to my original data’s variables (e.g. columns). Is this right? Or do I have to project my data back to the original axes after performing PCA and using only a subset of the components? If so, how do I then even reduce input dimensions? I tried ‘reversing’ PCA and I received the same number of variables as before, just with different values in the matrix.
Hope these questions are reasonable and I appreciate any help you can offer. Unfortunately, I was not able to find answers researching the web.
Best wishes.

Best Answer

Trying to answer your questions roughly in the order you asked.
The total number of principal components (PC) is equal to the number of your variables. The PC variables are simply a linear combination of the original variables, projected onto a different set of mutually perpendicular axes.
MATLAB will always generate p principal components, where p is the number of variables (columns) in your data. The transformed data will have the same number of dimensions as the original data. There is no dimensional reduction until you choose to select only a subset of the PCs (thereby accepting some loss of the ability to explain all of the variation).
The explained vector will be length p. It does list how much of your data’s variance is included in the resulting principal components. If you sum up the entire vector, it will sum to 100, as you say. Suppose p = 5 and
explained = [60; 20; 10; 7; 3]
It sums to 100. But if you decide to only use the first two PCs, then you will have explained 80% of the total variation. Not 100% (as you seem to imply in your question).
For the answer to your last questions about the score variable, take a look at my answer here, which has a detailed example (using more informative variable names). I think I explain it there.
If you search "cyclist" and "PCA" in this forum, you will find some other stuff from me that might be helpful.
Related Question