Solved – Nested cross validation using random forests

cross-validationhyperparameterrandom forest

I would like to tune the hyperparameters of a random forest and then obtain an unbiased score of its performance.

As I understand, the natural way would be to use nested cross validation. However, it seems to me that with this method the best hyperparameters found in each inner loop might differ across loops, which then creates a problem when you want to report the hyperparameter settings for which the average score was obtained. Ideally, I would like to first settle on a hyperparameter setting, then get an error measure specific to that setting.

I came up with the following procedure:

  1. Separate the dataset into a validation and a train+test set.
  2. Perform a grid search using cross validation on the validation set to find optimal hyperparameterst.
  3. Fit the random forest with the optimal hyperparameters on the train+test set, and report the out-of-bag error.

Point 3. rests on the fact that cross validation is essentially not needed with random forests, as the out-of-bag error is unbiased.

I would like to make sure that this is a sound approach, or whether perhaps there is a more natural way of doing this.

Best Answer

I think you got it quite right but not exactly. Here my suggestion:

  1. Separate the dataset into a test and a train+validation set.
  2. Perform a grid search using cross validation on the train set to find optimal hyperparameters optimized on the validation set (for random forest, this would be defining your mtry).
  3. Use the entire train+validation with the optimal hyperparameters and report the error using the test set.

Only this way, you ensure that the performance is measured on a part of the data the model has never seen. I recommend splitting Nr. 1 a few times, for example with the 10-fold CV, to make the performance measure less prone to variance.

The mlr package has good explanations on this nested cross-validation https://mlr-org.github.io/mlr-tutorial/devel/html/nested_resampling/index.html

Related Question