Solved – Why does PCA feature reduction make accuracy dramatically worse

classificationdimensionality reductionpcascikit learn

I'm trying to estimate how much feature reduction using PCA can help with increasing accuracy in case of classification using different ml methods. I'm using digits dataset available in scikit-learn. To do it, I'm checking accuracy using 64 features available, later using PCA, I reduce it to 63 features and accuracy decreases extremely:

###ANN:

featureNum #accuracy

64  | 0.966 +- 0.008

63  | 0.132 +- 0.0116619037897

###SVM:

featureNum accuracy

64  | 0.96 +-   0.0

63  | 0.54 +-   0.0

###RandomForest:

featureNum accuracy

64  | 0.974 +- 0.008

63  | 0.12 +-   0.022803508502

###DecisiontTree:

featureNum accuracy

64 |    0.802 +- 0.0172046505341

63 |    0.11 +- 0.0126491106407

All calculations were repeated 5 times to get statistics. Before using PCA (64 features) scores where quite good in all cases. After, In case of all tested methods apart from SVM, it was practically random (there're 10 classes). I would understand that accuraccy dropped a little because we loose some information for sure using PCA but it's quite extreme. 64 features is quite a lot so I rather expected that accuracy can increase. I tried also with dataset create with make_classifiation using 100 features and but it didn't change much. Described results I got when using 1000 records from digits dataset, I tried with lower amount of data but still it's the same results more or less.

Best Answer

I've created a notebook that almost replicates your drop in accuracy.

I think that most likely error is actually retraining PCA - if you fit PCA on train set, then fit classifier, and then try to run it on principal components retrieved from the test set, then you use incorrect parameter space for the classifier - classifier uses train set principal components as coordinates, and then you run it on test set PCs.