Solved – How to prevent cross validation from overfitting

cross-validationscikit learn

I a simple multiclass classification problem in which Logistic Regression works well on. (Over 65% test accuracy, which is considered good for this dataset. But the data set is small, it has 5 classes with slight imbalance, 18 features and 350 rows. Train/Test split is 75/25.)

I then tried XGBoost, and performed grid search to find the optimal hyperparameters. the CV I choose was 5 fold StratifiedKFold.

The problem I'm having is that the test error is much higher than the training error, where the training error is determined after the grid search finishes and the model is retrained on the entire training data set.

This clearly shows that the model is overfitting, and most baffling of all I know a set of parameters which is included in the grid search can achieve lower test error (error when tested against the 25% unseen data)

What am I doing wrong here? How can I prevent overfitting when during cross validation?

My entire pipeline is as follows
Data -> Train/Test split -> Cross validate on Train -> Retrain model on Train /w best hyperparameters -> Final evaluation on test

Maybe I'm leaking data somewhere?

xgb_pipeline = Pipeline([('ss', StandardScaler()),
                         ('pca', PCA()),
                         ('xgb_clf', XGBClassifier(n_jobs=1))])

xgb_param_grid1 = {
    'pca__n_components':[18, 15, 12],
    'xgb_clf__n_estimators':[5],
    'xgb_clf__gamma':[0, 0.5, 1],
    'xgb_clf__colsample_bytree':[0.7, 1],
    'xgb_clf__max_depth':[3, 5],
    'xgb_clf__reg_alpha':[0, 0.5,1], 
    'xgb_clf__reg_lambda':[0, 0.5,1],
}

cv = StratifiedKFold(5, shuffle=False)

xgb_grid_cv = GridSearchCV(xgb_pipeline, xgb_param_grid1, cv=cv, verbose=1, n_jobs=-1, refit=True)

xgb_grid_cv.fit(X_train, y_train)

xgb_grid_cv.score(X_train, y_train)
# Score close to 1

xgb_grid_cv.score(X_test, y_test)
# Score close to 0.4

Thanks in advance

Best Answer

Cross validation over the grid of hyperparameters will find the combination of hyperparameters that have the lowest test error. Thus, you will find the set of parameters with the least amount of overfitting, but you are not guaranteed to not overfit, particularly when you place limits on the hyperparameters. In this case, since you have a small dataset, I would start by trying to include lower max depths than 3.

Related Question