Solved – The first principal component does not separate classes, but other PCs do; how is that possible

classificationdimensionality reductionmachine learningpca

I ran PCA on 17 quantitative variables in order to obtain a smaller set of variables, that is principal components, to be used in supervised machine learning for classifying instances into two classes. After PCA the PC1 accounts for 31% of the variance in the data, PC2 accounts for 17%, PC3 accounts for 10%, PC4 accounts for 8%, PC5 accounts for 7% and PC6 accounts for 6%.

However, when I look at mean differences between PCs among the two classes, surprisingly, PC1 is not a good discriminator between the two classes. Remaining PCs are good discriminators. In addition, PC1 becomes irrelevant when used in a decision tree which means that after tree pruning it is not even present in the tree. The tree consists of PC2-PC6.

Is there any explanation for this phenomenon? Can it be something wrong with the derived variables?

Best Answer

It can also happen if the variables are not scaled to have unit variance before doing PCA. For example, for these data (notice that the $y$ scale only goes from $-0.5$ to $1$ whereas $x$ goes from $-3$ to $3$):

enter image description here

PC1 is approximately $x$ and accounts for almost all the variance, but has no discriminatory power, whereas PC2 is $y$ and discriminates perfectly between the classes.

Related Question