I'm creating some classifiers for a binary classification problem. I want to find out three things:
- Which algorithm I should use.
- Which set of hyperparameters I
should use. - If I should calibrate the probability output of the
classifier or not.
I was wondering how best to do this. Basically I'm doing nested cross-validation (outer loop for algorithm and inner for hyperparamters) and combining it with probability calibration (and I know I shouldn't use the same data to train the model and calibrate probabilities). Here's the code I've come up with (it uses a toy dataset):
# loading data
cancer = datasets.load_breast_cancer()
X = pd.DataFrame(cancer['data'], columns=cancer['feature_names'])
y = pd.DataFrame(cancer['target'], columns=['target'])
df = pd.DataFrame(pd.concat([X, y], axis=1))
# creating holdout data for final model evaluation
X, X_hold, y, y_hold = train_test_split(X,y,train_size=0.8, random_state=35)
# defining everything needed for cross-validation
kfold = KFold(3, random_state=1234, shuffle=True)
rf = RandomForestClassifier()
cart = DecisionTreeClassifier()
rf_parameters = {'n_estimators': [10, 40, 100], 'max_depth': [1, 5, 10]}
cart_parameters = { 'max_depth': [1, 5, 10]}
models = {cart:cart_parameters, rf:rf_parameters}
scoring = {'AUC': 'roc_auc', 'Brier_loss': 'neg_brier_score'}
brier_scores = []
resulting_models = []
for m,p in models.items():
for train_index, test_index in kfold.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
X_train_model, X_train_calibration, y_train_model, y_train_calibration \
= train_test_split(X_train, y_train, test_size=0.4, random_state=1234)
clf = GridSearchCV(estimator=m, param_grid=p, scoring=scoring,
refit='Brier_loss', n_jobs=8, cv=3, verbose=1)
clf.fit(X_train_model, y_train_model)
best_model = clf.best_estimator_
predictions = best_model.predict_proba(X_test)[:,1]
calibrated = CalibratedClassifierCV(best_model, cv="prefit")
calibrated.fit(X_train_calibration, y_train_calibration)
predictions_calibrated = calibrated.predict_proba(X_test)[:,1]
score = brier_score_loss(y_test, predictions)
calibrated_score = brier_score_loss(y_test, predictions_calibrated)
if score <= calibrated_score:
resulting_models.append(best_model)
brier_scores.append(score)
else:
resulting_models.append(calibrated)
brier_scores.append(calibrated_score)
# printing results for decision
final_scores = list(zip(resulting_models, brier_scores))
final_scores_rf = [final_scores[i][1] for i in range(1,len(final_scores), 2)]
final_scores_cart = [final_scores[i][1] for i in range(0,len(final_scores), 2)]
print('CART:', reduce(lambda x,y: x+y, final_scores_cart)/3, 'RF:', reduce(lambda x,y: x+y, final_scores_rf)/3)
At the end of this bit of code, I will be able to decide which algorithm to use (CART vs RF). I will then remove the first loop so I can decide which set of hyperparameters to use. After that, I will remove the GridSearch part to decide only whether I should calibrate my probabilities or not.
After all this, I can evaluate the "true" error of my model using X_hold and y_hold. Then I'll retrain the model and calibration (if necessary) using the full dataset.
Does this make sense to you? Any suggestions on how to do this properly? I feel like I'm missing something.
Best Answer
I have this exact problem. Well, in fact I had another layer of complexity because I wanted to select also the best post-processing (for example, scaling, PCA, selecting K Best...) and I wondered if I needed another step of CV or not. I have searched a lot for answers on the internet and have not found any, so I'll try to explain what I have done. First of all, my thoughts:
So I have thougth how to make sense of all of it, that is, how to select best calibrated model + best parameters (including best post-process parameters or procedure), and how to assess how good is the model in a completely unseen data. And I wanted to do it with double nested cross validation, because three layers is too much complexity.
In my view, the inner cross validation should train the model as if it were the definite one, so the calibration should enter in the inner layer. The difficult that then arises is that you don't have, in principle, new data to see how good is the model in order to select best parameters (by the way, brier loss is, as far as I know, a not very good metric). To deal with it, I have tried two things:
As code is more explanatory than words, I'll post here my own code of this function that tries to do what I explained before:
I always try to write nice code and end up with a something very messy... I'm sorry. You can ask for anything that is not clear.