Cross validation over the grid of hyperparameters will find the combination of hyperparameters that have the lowest test error. Thus, you will find the set of parameters with the least amount of overfitting, but you are not guaranteed to not overfit, particularly when you place limits on the hyperparameters. In this case, since you have a small dataset, I would start by trying to include lower max depths than 3.
My demonstration code is here. Check it out, then come back here and read/leave comments. If you want to edit my code to make it run on a wider variety of machines that would be wonderful; just submit a pull request and/or fork the code yourself and let me know as a courtesy.
Cross-validation is an approximation of Bayesian optimization, so it is not necessary to use it with Optuna. However, if it makes sense and you have the time to do it, it will simply result in meta-optimization.
I realized this yesterday when I was setting up a parameter study and realized it would take months, not just days, to complete. Here is how I had it set up initially:
Combine MNIST training and test sets into a single data set, then train and validate $n$ times (for MNIST combined, $n = 70,000$) holding $n_v = 65,697$ samples, randomly chosen, out for validation during each training epoch in each cross-validation replicate.
Train each replicate's model using $n_c = 4,303$ samples not held out for validation (with loss at end of each epoch computed on held-out $n_v$ samples) until validation loss ceases to decrease for a specified number of epochs (via Keras' EarlyStopping
callback with patience parameter). Report the model's performance to Optuna as the average validation loss for the best trained model over all replicates. See below for how to choose $n_c = n - n_v$ (training set size).
Combine steps 1 and 2 into an objective function for Optuna to minimize average validation loss. Perform the study, being sure to specify that direction='minimize'
.
What I ended up using instead was something that took only a few minutes (maximum) per Optuna trial, instead of 70,000 times longer:
Load MNIST training and test sets separately and hold $n_v = n - n_c$ samples out from the training set only, randomly chosen, for validation during each training epoch.
Train a single model using $n_c$ samples (for MNIST, this was 3,833) until validation loss (computed on held-out $n_v = n - n_c$ samples) ceases to decrease for a few epochs (as in Keras' EarlyStopping
callback with patience parameter). Evaluate the performance of the model with lowest validation loss by inferencing on the test data set (not involved in either training or validation) and calculating probabilities involved by any relevant consequences as a measure of risk (if possible).
Combine steps 1 and 2 into an objective function for Optuna to minimize or maximize (depending on the risk/evaluation metric). Then perform the study, being sure to specify the direction of optimization (either direction='maximize'
or direction='minimize'
as appropriate).
NOTE: In this case $n_c = n^{3/4}$ based on the article cited below; normally people use less than half the data for validation but in this case it will always be no more than half for training, and usually far less. Also, repeating the training and validation steps at least $n$ times was required by the same article. Even though I am not doing cross-validation, I am still performing validation during training, so despite the lack of a theoretical basis for doing so, I am still using this exponential calculation for the size of the training set. Common practice is to hold out 1/5 of data (or 1/K for K folds) for validation. In this case, I am holding out $1 - n^{3/4}$ proportion of the data for validation, which ends up being more than 1/2 of the entire data set always.
Author: Shao, Jun
Article: Linear model selection by cross-validation
Publication info: Journal of the American Statistical Association ; Alexandria Vol. 88, Iss. 422, (Jun 1993): 486.
Abstract: The problem of selecting a model having the best predictive value among a class of linear models is considered. Motivations, justifications and discussions of some practical aspects of the use of the leave-nv-out cross-validation method are provided, and results from a simulation study are presented.
Best Answer
The average score of your agent on the validation sets in cross validation should not be significantly better than the score on the holdout (final test) set, because for each cross-validation fold the agent is as blind to the data in the validation set as your final model (presumably retrained on all the training data) is to the holdout data.
Is there some quality of your holdout set that is causing your results to be skewed? Are the examples in the training set easier somehow? If you select a different random holdout set, does the problem persist?
So long as the variance in performance across the folds is low, then so long as the holdout set is statistically similar to the training set, the performance should not degrade. Best practice is to select the parameters that yield the reliably (low variance across folds) best-performing (high mean) cross validation score. I am not aware of any alternative criteria nor arguments why one might prefer them.