Solved – Hyperparameter tuning on the whole data set reasonable

hyperparametermachine learningpythonrandom forest

It may be a weird question because I don't fully understand hyperparameter-tuning yet.

Currently I'm using gridSearchCV of sklearn to tune the parameters of a randomForestClassifier like this:

gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)
results = gs.cv_results_

After that I check the gs object for the best_params and best_score. Now I'm using best_params to instantiate a RandomForestClassifier and use stratified validation again to record metrics and print a confusion matrix:

rf = RandomForestClassifier(n_estimators=1000, min_samples_leaf=7, max_depth=18, criterion='entropy', random_state=42)
accuracy = []
metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
counter = 0

print('################################################### RandomForest ###################################################')
for train_index, test_index in skf.split(X_Distances,Y):
    X_train, X_test = X_Distances[train_index], X_Distances[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)

    precision, recall, fscore, support = np.round(score(y_test, y_pred), 2)
    metrics['accuracy'].append(round(accuracy_score(y_test, y_pred), 2))
    metrics['precision'].append(precision)
    metrics['recall'].append(recall)
    metrics['fscore'].append(fscore)
    metrics['support'].append(support)

    print(classification_report(y_test, y_pred))
    matrix = confusion_matrix(y_test, y_pred)
    methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
    counter = counter+1

meanAcc= round(np.mean(np.asarray(metrics['accuracy'])),2)*100
print('meanAcc: ', meanAcc)

Is this a reasonable approach or do I have something completely wrong?

EDIT:

I just tested the following:

gs = GridSearchCV(RandomForestClassifier(n_estimators=100, random_state=42), param_grid={'max_depth': range(5, 25, 4), 'min_samples_leaf': range(5, 40, 5),'criterion': ['entropy', 'gini']}, scoring=scoring, cv=3, refit='Accuracy', n_jobs=-1)
gs.fit(X_Distances, Y)

This yields best_score = 0.5362903225806451 at best_index = 28. When I check the accuracies in the 3 folds at index 28 I get:

  1. split0: 0.5185929648241207
  2. split1: 0.526686807653575
  3. split2: 0.5637651821862348

Which leads to the mean test accuracy: 0.5362903225806451. best_params: {'criterion': 'entropy', 'max_depth': 21, 'min_samples_leaf': 5}

Now I run this code which is using the mentioned best_params with a stratified 3 fold split (like GridSearchCV):

rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5, max_depth=21, criterion='entropy', random_state=42)
metrics = {'accuracy':[], 'precision':[], 'recall':[], 'fscore':[], 'support':[]}
counter = 0
print('################################################### RandomForest_Gini ###################################################')
for train_index, test_index in skf.split(X_Distances,Y):
    X_train, X_test = X_Distances[train_index], X_Distances[test_index]
    y_train, y_test = Y[train_index], Y[test_index]
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)

    precision, recall, fscore, support = np.round(score(y_test, y_pred))
    metrics['accuracy'].append(accuracy_score(y_test, y_pred))
    metrics['precision'].append(precision)
    metrics['recall'].append(recall)
    metrics['fscore'].append(fscore)
    metrics['support'].append(support)

    print(classification_report(y_test, y_pred))
    matrix = confusion_matrix(y_test, y_pred)
    methods.saveConfusionMatrix(matrix, ('confusion_matrix_randomforest_distances_' + str(counter) +'.png'))
    counter = counter+1

meanAcc= np.mean(np.asarray(metrics['accuracy']))
print('meanAcc: ', meanAcc)

The metrics dictionairy yields the exact same accuracies (split0: 0.5185929648241207, split1: 0.526686807653575, split2: 0.5637651821862348)

However the mean calculation is a bit off: 0.5363483182213101. With this approach I get the actual predictions of the best_estimator found by gridSearchCV. Now I can plot a confusion matrix for each fold to analyse. The productive model would be trained with my whole data set.

Best Answer

Gridsearch uses crossvalidation, if you take the best parameters you should be able to reproduce the best result, just be carefull to leave aside your test data and use it only at the end.

20-30 % test data is the usual.

Related Question