Classification – Combining PCA and Random Forests for Improved Classification

classificationpcarandom forest

For a recent Kaggle competition, I (manually) defined 10 additional features for my training set, which would then be used to train a random forests classifier. I decided to run PCA on the dataset with the new features, to see how they compared to each other. I found that ~98% of the variance was carried by the first component (the first eigenvector). I then trained the classifier multiple times, adding one feature at a time, and used cross-validation and RMS error to compare the quality of the classification. I found that the classifications improved with each additional feature, and that the final result (with all 10 new features) was far better than the first run with (say) 2 features.

  • Given that PCA claimed ~98% of the variance was in the first component of my dataset, why did the quality of the classifications improve so much?

  • Would this hold true for other classifiers? RF scales across multiple cores, so it's much faster to train than (say) SVM.

  • What if I had transformed the dataset into the "PCA" space, and run the classifier on the transformed space. How would my results change?

Best Answer

When doing predictive modeling, you are trying to explain the variation in the response, not the variation in the features. There is no reason to believe that cramming as much of the feature variation into a single new feature will capture a large amount of the predictive power of the features as a whole.

This is often explained as the difference between Principal Component Regression instead of Partial Least Squares.