Solved – Bayesian hyperparameter optimization + cross-validation

bayesian optimizationcross-validationhyperparametermachine learningneural networks

I want to use Bayesian optimization to search a space of hyperparameters for a neural network model. My objective function for this optimization is validation set accuracy.

In addition, I want to perform cross-validation such that I can get a good estimate of the best hyperparameters for test-set performance when training on the whole training set.

Given these two desires, and a search space for the bayesian optimization procedure, I can see two options for how to conduct the experiment at a high level.

In the first, I split the training set into N folds. Then for each fold I run the entire Bayesian optimization process, this produces N sets of values for my hyperparameters, a best set for each fold. I choose the best set among those from the N folds and retrain on the whole training set. This is cross-validation in the classical setting.

In the second, within each evaluation of the objective function for the bayesian optimization, I perform cross-validation to find the best validation set accuracy. Thus I train the model with the fixed hyperparameters that are the point in the search space being evaluated. I do this for each training set fold, and evaluate on each respective validation fold. Then, the objective function value returned for this evaluation in the Bayesian optimization procedure is the best validation set accuracy.

My question is this: are these two approaches equivalent? Is the latter a statistically valid estimate of the best parameters, or something else entirely? Are there any (dis)advantages either way? The latter is considerably easier to implement given the Bayesian optimization framework I'm using (Optuna).

Best Answer

My demonstration code is here. Check it out, then come back here and read/leave comments. If you want to edit my code to make it run on a wider variety of machines that would be wonderful; just submit a pull request and/or fork the code yourself and let me know as a courtesy.

Cross-validation is an approximation of Bayesian optimization, so it is not necessary to use it with Optuna. However, if it makes sense and you have the time to do it, it will simply result in meta-optimization.

I realized this yesterday when I was setting up a parameter study and realized it would take months, not just days, to complete. Here is how I had it set up initially:

  1. Combine MNIST training and test sets into a single data set, then train and validate $n$ times (for MNIST combined, $n = 70,000$) holding $n_v = 65,697$ samples, randomly chosen, out for validation during each training epoch in each cross-validation replicate.

  2. Train each replicate's model using $n_c = 4,303$ samples not held out for validation (with loss at end of each epoch computed on held-out $n_v$ samples) until validation loss ceases to decrease for a specified number of epochs (via Keras' EarlyStopping callback with patience parameter). Report the model's performance to Optuna as the average validation loss for the best trained model over all replicates. See below for how to choose $n_c = n - n_v$ (training set size).

  3. Combine steps 1 and 2 into an objective function for Optuna to minimize average validation loss. Perform the study, being sure to specify that direction='minimize'.

What I ended up using instead was something that took only a few minutes (maximum) per Optuna trial, instead of 70,000 times longer:

  1. Load MNIST training and test sets separately and hold $n_v = n - n_c$ samples out from the training set only, randomly chosen, for validation during each training epoch.

  2. Train a single model using $n_c$ samples (for MNIST, this was 3,833) until validation loss (computed on held-out $n_v = n - n_c$ samples) ceases to decrease for a few epochs (as in Keras' EarlyStopping callback with patience parameter). Evaluate the performance of the model with lowest validation loss by inferencing on the test data set (not involved in either training or validation) and calculating probabilities involved by any relevant consequences as a measure of risk (if possible).

  3. Combine steps 1 and 2 into an objective function for Optuna to minimize or maximize (depending on the risk/evaluation metric). Then perform the study, being sure to specify the direction of optimization (either direction='maximize' or direction='minimize' as appropriate).

NOTE: In this case $n_c = n^{3/4}$ based on the article cited below; normally people use less than half the data for validation but in this case it will always be no more than half for training, and usually far less. Also, repeating the training and validation steps at least $n$ times was required by the same article. Even though I am not doing cross-validation, I am still performing validation during training, so despite the lack of a theoretical basis for doing so, I am still using this exponential calculation for the size of the training set. Common practice is to hold out 1/5 of data (or 1/K for K folds) for validation. In this case, I am holding out $1 - n^{3/4}$ proportion of the data for validation, which ends up being more than 1/2 of the entire data set always.

Author: Shao, Jun

Article: Linear model selection by cross-validation

Publication info: Journal of the American Statistical Association ; Alexandria  Vol. 88, Iss. 422,  (Jun 1993): 486.

Abstract: The problem of selecting a model having the best predictive value among a class of linear models is considered. Motivations, justifications and discussions of some practical aspects of the use of the leave-nv-out cross-validation method are provided, and results from a simulation study are presented.