Model Selection and Cross-Validation – The Right Strategies

cross-validationmodel selection

There are numerous threads in CrossValidated on the topic of model selection and cross validation. Here are a few:

However, the answers to those threads are fairly generic and mostly highlight the issues with particular approaches to cross validation and model selection.

To make things as concrete as possible, say for example that we are working with an SVM with an RBF kernel:
$K(x, x' ) = (\gamma \, \vert x – x'\vert)^2$, and that I have a dataset of features X and labels y, and that I want to

  1. Find the best possible values of my model ($\gamma$ and $C$)
  2. Train the SVM with my dataset (for final deployment)
  3. Estimate the generalization error and the uncertainty (variance) around this error

To do so, I would personally do a grid search, e.g. I try every possible combination of $C$ and $\gamma$. For simplicity, we can assume the following ranges:

  • $C \in \{10, 100, 1000\}$
  • $\gamma \in \{0.1, 0.2, 0.5, 1.0\}$

More specifically, using my full dataset I do the following:

  1. For every ($C$,$\gamma$) pair, I do repeated iterations (e.g. 100 random repetitions) of $K$-fold cross validation (e.g. $K=10$), on my dataset, i.e. I train my SVM on $K-1$ folds and evaluate the error on the fold left, iterating through all $K$ folds. Overall, I collect 100 x 10 = 1000 test errors.
  2. For each such ($C$,$\gamma$) pair, I compute the mean and the variance of those 1000 test errors $\mu_M, \sigma_M$.

Now I want to choose the best model (the best kernel parameters) that I would use to train my final SVM on the full dataset. My understanding is that choosing the model that had the lowest error mean and variance $\mu_M$ and $\sigma_M$ would be the right choice, and that this model's $\mu_M$ are $\sigma_M$ are my best estimates of the model's generalization error bias and variance when training with the full dataset.

BUT, after reading the answers in the threads above, I am getting the impression that this method for choosing the best SVM for deployment and/or for estimating its error (generalization performance), is flawed, and that there are better ways of choosing the best SVM and reporting its error. If so, what are they? I am looking for a concrete answer please.

Sticking to this problem, how specifically can I choose the best model and properly estimate its generalization error?

Best Answer

My paper in JMLR addresses this exact question, and demonstrates why the procedure suggested in the question (or at least one very like it) results in optimistically biased performance estimates:

Gavin C. Cawley, Nicola L. C. Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", Journal of Machine Learning Research, 11(Jul):2079−2107, 2010. (www)

The key thing to remember is that cross-validation is a technique for estimating the generalisation performance for a method of generating a model, rather than of the model itself. So if choosing kernel parameters is part of the process of generating the model, you need to cross-validate the model selection process as well, otherwise you will end up with an optimistically biased performance estimate (as will happen with the procedure you propose).

Assume you have a function fit_model, which takes in a dataset consisting of attributes X and desired responses Y, and which returns the fitted model for that dataset, including the tuning of hyper-parameters (in this case kernel and regularisation parameters). This tuning of hyper-parameters can be performed in many ways, for example minimising the cross-validation error over X and Y.

Step 1 - Fit the model to all available data, using the function fit_model. This gives you the model that you will use in operation or deployment.

Step 2 - Performance evaluation. Perform repeated cross-validation using all available data. In each fold, the data are partitioned into a training set and a test set. Fit the model using the training set (record hyper-parameter values for the fitted model) and evaluate performance on the test set. Use the mean over all of the test sets as a performance estimate (and perhaps look at the spread of values as well).

Step 3 - Variability of hyper-parameter settings - perform analysis of hyper-parameter values collected in step 3. However I should point out that there is nothing special about hyper-parameters, they are just parameters of the model that have been estimated (indirectly) from the data. They are treated as hyper-parameters rather than parameters for computational/mathematical convenience, but this doesn't have to be the case.

The problem with using cross-validation here is that the training and test data are not independent samples (as they share data) which means that the estimate of the variance of the performance estimate and of the hyper-parameters is likely to be biased (i.e. smaller than it would be for genuinely independent samples of data in each fold). Rather than repeated cross-validation, I would probably use bootstrapping instead and bag the resulting models if this was computationally feasible.

The key point is that to get an unbiased performance estimate, whatever procedure you use to generate the final model (fit_model) must be repeated in its entirety independently in each fold of the cross-validation procedure.

Related Question