Why does classifier (XGBoost) “after PCA” runtime increase compared to “before PCA”

adaboostboostingmachine learningpcascikit learn

The short version:
I am trying to compare different classifiers for a certain dataset from kaggle, and am trying to also compare these classifiers between before using PCA (form sklearn) to after using PCA in terms of accuracy and runtime. For some reason the runtime of the classifiers (XGBoost and AdaBoost to take 2 as an example) after the use of PCA is 3 times (approximately) the runtime of the classifiers before the use of PCA. My question is: why? am I doing something wrong or is it possible?

The long version:
my understanding of how to use PCA:

  • have normalized and clean datasets split into training and testing sets (using train_test_split).
  • PCA fit and transform the X_train and save it to a new df
  • Using the fitted PCA, transform (without fitting) the X_test
  • run the classifier with the transformed X_train and X_test

    PS: I have checked that the number of dimentions is decreasing (from 21 to the number specified: 17 in case of 90% of the variance). The dataset size is around 130000 entries, taken from kaggle.
    The code written to achieve this:

pca = PCA(n_components=0.9)
X_train_Reduced = pca.fit_transform(X_train)
X_test_Reduced = pca.transform(X_test)

Classifier (XGBoost) before the use of PCA:

start_timeXGBoost = time.time()
warnings.filterwarnings('ignore')
modelXGBoost = XGBClassifier(learning_rate = 0.2, n_estimators = 200, verbosity = 0, use_label_encoder = False, n_jobs = -1)
modelXGBoost.fit(X_train, y_train)
predictionsXGBoost = modelXGBoost.predict(X_test)
accuracyXGBoost = metrics.accuracy_score(y_test, predictionsXGBoost)
print("Accuracy (XGBoost): ", accuracyXGBoost)
timeXGBoost = time.time() - start_timeXGBoost
print("Time taken to achive result: %s seconds" % (timeXGBoost))

Output of code:

Accuracy (XGBoost): 0.9655066214967662
Time taken to achive result: 3.33561372756958 seconds

Classifier (XGBoost) After PCA:

start_timeXGBoost = time.time()
warnings.filterwarnings('ignore')
modelXGBoost = XGBClassifier(learning_rate = 0.2, n_estimators = 200, verbosity = 0, use_label_encoder = False,
                             n_jobs = -1)
modelXGBoost.fit(X_train_Reduced, y_train)
predictionsXGBoost = modelXGBoost.predict(X_test_Reduced)
accuracyXGBoost = metrics.accuracy_score(y_test, predictionsXGBoost)
print("Accuracy (XGBoost): ", accuracyXGBoost)
timeXGBoost = time.time() - start_timeXGBoost
print("Time taken to achive result: %s seconds" % (timeXGBoost))

Output of Code:

Accuracy (XGBoost): 0.93032029565753
Time taken to achive result: 10.376214981079102 seconds

Another example (AdaBoost)
Classifier (AdaBoost) before PCA:

start_timeAdaBoost = time.time()
modelDecTree = DecisionTreeClassifier(random_state=0, max_depth=2)
modelAdaBoost = AdaBoostClassifier(modelDecTree, n_estimators = 1000, random_state = 0, learning_rate = 1)
modelAdaBoost.fit(X_train, y_train)
predictionsAdaBoost = modelAdaBoost.predict(X_test)
accuracyAdaBoost = metrics.accuracy_score(y_test, predictionsAdaBoost)
print("Accuracy (AdaBoost): ", accuracyAdaBoost)
timeAdaBoost = time.time() - start_timeAdaBoost
print("Time taken to achive result: %s seconds" % (timeAdaBoost))

Output of code:

Accuracy (AdaBoost): 0.9575762242069603
Time taken to achive result: 103.38761949539185 seconds

Classifier (AdaBoost) after PCA:

start_timeAdaBoost = time.time()
modelDecTree = DecisionTreeClassifier(random_state=0, max_depth=2)
modelAdaBoost = AdaBoostClassifier(modelDecTree, n_estimators = 1000, random_state = 0, learning_rate = 1)
modelAdaBoost.fit(X_train_Reduced, y_train)
predictionsAdaBoost = modelAdaBoost.predict(X_test_Reduced)
accuracyAdaBoost = metrics.accuracy_score(y_test, predictionsAdaBoost)
print("Accuracy (AdaBoost): ", accuracyAdaBoost)
timeAdaBoost = time.time() - start_timeAdaBoost
print("Time taken to achive result: %s seconds" % (timeAdaBoost))

Output of code:

Accuracy (AdaBoost): 0.9141515244841392
Time taken to achive result: 295.6763050556183 seconds

I would very much appreciate any help in the matter of understanding what I have done wrong (or right).
Thank you all in advance

Best Answer

If your original data was relatively discrete (had many "ties" for some or all the features), then this is a likely outcome, even after reducing the number of features.

In a tree model like yours, at each node every potential split is evaluated: for each feature and each unique value thereof, split the data into the values less than or greater than that value. (Some implementations may shrink the space potential splits e.g. with histogram binning.) When your original data was somewhat discrete, this was a much smaller space of potential splits than the actual number of entries in your dataframe. After PCA though, you've effectively rotated the axes (and dropped some) of your space, and so two rows having the exact same value of a principal component is much less likely. So the number of candidate splits will be very close to the total number of entries in your dataframe, and even if there are fewer columns this might be significantly larger than the original number of candidate splits. And naturally it takes longer to evaluate them all.

Related Question