Solved – How to split dataset for model selection and tuning

cross-validationhyperparametermodel selection

I have read as many questions as I could on model selection, cross validation, and hyperparameter tuning and I am still confused on how to partition a dataset for the full training/tuning process.

The scenario: I have 100,000 training instances and I need to pick between 3 competing models (say random forest, ridge, and SVR). I also need to tune the hyperparameters of the selected model. Here is how I think the process should look.

Step 1: Split the data into 80,000 training and 20,000 test sets.

Step 2: Using cross validation, train and evaluate the performance of each model on the 80,000 training set (e.g. using 10 fold cv I would be training on 72,000 and testing against 8,000 10 times).

Step 3: Use the 20,000 test set to see how well the models generalize to unseen data, and pick a winner (say ridge).

Step 4: Go back to the 80,000 training data and use cross validation to re-train the model and tune the ridge alpha level.

Step 5: Test the tuned model on the 20,000 test set.

Step 6: Train tuned model on full dataset before putting into production.

Is this approach generally correct? I know that this example skimps on technical details, but I am wondering specifically about the partitioning of the dataset for selecting and tuning.

If this is not correct, please provide the steps and numeric splits that you would use in this scenario.

Best Answer

I have also been crawling these threads on this topic.

Step 1: Split the data into 80,000 training and 20,000 test sets.

Step 2: Using cross validation, train and evaluate the performance of each model on the 80,000 training set (e.g. using 10 fold cv I would be training on 72,000 and testing against 8,000 10 times).

Ok up to this point!

Step 3: Use the 20,000 test set to see how well the models generalize to unseen data, and pick a winner (say ridge).

Either do this on a portion of the training set that was not used to tune parameters, or implement nested cross validation in your training set (e.g., use 3/4 of each fold to train and 1/4 to select among RF, logistic regression, etc).

Step 4: Go back to the 80,000 training data and use cross validation to re-train the model and tune the ridge alpha level.

Step 5: Test the tuned model on the 20,000 test set.

This would not be a valid estimate of the error as you've already used this data to choose one of the three RF, LR, etc..

Step 6: Train tuned model on full dataset before putting into production.

Tuning the model should be considered a step in the training process.


Say you have 2 models: RF with param NE = 100, 200; LR with param C = 0.1, 0.2.

You have 2 options (you can mix and match them as long as you adhere to the basic principle: if you use data to make a decision, don't use that same data to evaluate):

A

  • Step 1. Split all data into train_validate and test. Put test in a vault.
  • Step 2. Split train_validate into train and validate.
  • Step 3. Train 2 RF on train with param NE = 100 and 200. Train 2 LR on train with param C=0.1 and 0.2. Try all four models on the validate. Choose the model model_se with the smallest error. This is your "modeling process".
  • Step 4. Unlock the vault and test model_se (as is) on test to get some error. This error (one number) will be expected error on unseen data.

(It appears you have many observations. There is no hard rule for this that I know of, but if your classes are balanced A might be most reasonable).

B

  • Convert step 1 into an (outer) loop. If you use 7 fold you will have 7 train_validates and 7 tests.

  • Convert step 2 and 4 into an (inner) loop. If you use 5 fold you will then 5 times create a train on which you will test the 4 models and then 5 times see which is best on validate. Take the model
    model_ba with best average performance over folds.

  • Test model_ba on the test set (in the outer fold) each time (each
    one will be a different model). Since within each outer loop you
    have an estimate of error, you will have 7 estimates errors. The
    average of these errors is E and the variance V.

  • Rerun the modeling process Steps 2 and 3 from scratch on the entire
    dataset. Eg, take 100% of the data and run step 2 to 4 (use the same train:validate split ratios or 5 fold CV that you used there).
    You will return some model M. You can expect performance E from model M on unseen data. The variance V unfortunately
    cannot be used to construct a 95% confidence interval (Bengio, 2004).

B is also known as 'nested cross validation,' but it is actually just plain cross validation of an entire modeling process (that involves tuning both parameters and hyperparameters or considering hyperparameters to be parameters and just tuning parameters (see here)). If you choose, B, it is worth running multiple iterations of B to see the variance of the entire process.

Other methods such as bootstrap may be preferable to cross validation. I have not had time to work out the details of why this is true.

Related Question