Solved – Double (nested,wrapper) CrossValidation – final trained model

cross-validation

I'm performing a study where I'm selecting kernel type and hyperparameters in an inner CV loop and an outer loop doing 10-fold CV (using SVR). The output is 10 trained models and performance measures.

My question is where do I go from here. When I train a new model with the complete dataset using the selected kernel (by either the hyperparameters that gave the min error during the 10-fold CV or finding the optimal ones with the selected kernel for the complete dataset) the final model I end up with is not validated against training data. Is it reasonable to do this and use the average error previously obtained from 10-fold CV as an "informal" performance estimate since the model is trained on a slightly larger dataset? How would I word this in a journal paper? My thesis advisor is questioning it for one.

Best Answer

It sounds like you're taking the correct approach - you'll want to do a nested CV so that you tune your parameters on the inner dataset, and then estimate the error on a holdout set that the model has never seen before. As an example:

Divide your training set into 10 folds. Use the 9 folds to tune your model (again through CV), and then estimate the error on the 10th fold that you held out. You can do this 10 times to get an estimate of the error (you could actually do it another set of 10 times if you randomly generated a different set of folds).

The Elements of Statistical Learning explicitly warns against "[Using] cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model." - emphasis mine (see Ch 7, Section 10.2). You can of course use CV to estimate these separately.

If you need another a citation for the importance of this, some researchers at Google just released a paper related to this. If you have access to the paper on Science, they even released their Python code along with the article.

update @user99889's question: What to do if outer CV finds instability?

First of all, detecting in the outer CV loop that the models do not yield stable predictions in that respect doesn't really differ from detecting that the prediciton error is too high for the application. It is one of the possible outcomes of model validation (or verification) implying that the model we have is not fit for its purpose.

In the comment answering @davips, I was thinking of tackling the instability in the inner CV - i.e. as part of the model optimization process.

But you are certainly right: if we change our model based on the findings of the outer CV, yet another round of independent testing of the changed model is necessary.
However, instability in the outer CV would also be a sign that the optimization wasn't set up well - so finding instability in the outer CV implies that the inner CV did not penalize instability in the necessary fashion - this would be my main point of critique in such a situation. In other words, why does the optimization allow/lead to heavily overfit models?

However, there is one peculiarity here that IMHO may excuse the further change of the "final" model after careful consideration of the exact circumstances: As we did detect overfitting, any proposed change (fewer d.f./more restrictive or aggregation) to the model would be in direction of less overfitting (or at least hyperparameters that are less prone to overfitting). The point of independent testing is to detect overfitting - underfitting can be detected by data that was already used in the training process.

So if we are talking, say, about further reducing the number of latent variables in a PLS model that would be comparably benign (if the proposed change would be a totally different type of model, say PLS instead of SVM, all bets would be off), and I'd be even more relaxed about it if I'd know that we are anyways in an intermediate stage of modeling - after all, if the optimized models are still unstable, there's no question that more cases are needed. Also, in many situations, you'll eventually need to perform studies that are designed to properly test various aspects of performance (e.g. generalization to data acquired in the future). Still, I'd insist that the full modeling process would need to be reported, and that the implications of these late changes would need to be carefully discussed.

Also, aggregation including and out-of-bag analogue CV estimate of performance would be possible from the already available results - which is the other type of "post-processing" of the model that I'd be willing to consider benign here. Yet again, it then would have been better if the study were designed from the beginning to check that aggregation provides no advantage over individual predcitions (which is another way of saying that the individual models are stable).

Update (2019): the more I think about these situations, the more I come to favor the "nested cross validation apparently without nesting" approach.

Solved – How to build the final model and tune probability threshold after nested cross-validation

Nested cross validation explained without nesting

Here's how I see (nested) cross validation and model building. Note that I'm chemist and like you look from the application side to the model building process (see below). My main point here is from my point of view I don't need a dedicated nested variety of cross validation. I need a validation method (e.g. cross validation) and a model training function:

model = f (training data)

"my" model training function f does not need any hyperparameters because it internally does all hyperparameter tuning (e.g. your alpha, lambda and threshold).

In other words, my training function may contain any number of inner cross validations (or out-of-bag or what ever performance estimate I may deem useful). However, note that the distinction between parameters and hyper-parameters typically is that the hyperparameters need to be tuned to the data set/application at hand whereas the parameters can then be fitted regardless of what data it is. Thus from the point of view of the developer of a new classification algorithm, it does make sense to provide only the "naked" fitting function (g (training data, hyperparameters)) that fits the parameters if given data and hyperparameters.

The point of having the "outer" training function f is that after you did your cross validation run, it gives you straightforward way to train "on the whole data set": just use f (whole data set) instead of the call f (cv split training data) for the cross validation surrogate models.

Thus in your example, you'll have 5+1 calls to f, and each of the calls to f will have e.g. 100 * 5 calls to g.

probability threshold

While you could do this with yet another cross validation, this is not necessary: it is just one more hyperparameter your ready-to-use model has and can be estimated inside f.

What you need to fix it is a heuristic that allows you to calculate such a threshold. There's a wide variety of heuristics (from ROC and specifying how important it is to avoid false positives compared to false negatives over minimum acceptable sensitivity or specificity or PPV or NPV to allowing two thresholds and thus an "uncertain" (NA) level and so on) that are suitable in different situations - good heuristics are usually very application specific.

But for the question here, you can do this inside f and e.g. using the predictions obtained during the inner cross validation to calculate ROC and then find your working point/threshold accordingly.

Specific Comments to parts of the question

I understand that I shouldn't report the performance from the CV used to pick the optimal hyperparameters as an estimate of the expected performance of my final model (which would be overly-optimistic) but should instead include an outer CV loop to get this estimate.

Yes. (Though the inner estimate does carry information in relation to the outer estimate: if it is much more optimisitc than the outer estimate, you are typically overfitting.)

I understand that the inner CV loop is used for model selection

Any kind of data-driven model tuning, really -> that includes tuning your cutoff-threshold.

(in this case, the optimal hyperparameters) and that the outer loop is used for model evaluation, i.e., the inner and outer CV serve two different purposes that often are erroneously conflated.

Yes.

That is, the hyperparameter tuning is part of "the method for building the model".

I prefer to see it this way as well: I'm chemist and like you look from the application side: for me a trained/fitted model is not complete without the hyperparameters, or more precisely, a model is something I can use directly to obtain predictions. Though as you note other people have a different view (without hyperparameter tuning). In my experience, this is often the case with people developing new models: hyperparameter tuning is then a "solved problem" and not considered. (side note: their view on what cross validation can do in terms of validation is also slightly different from what cross validation can do from the application side).

Best Answer

Related Solutions

Cross Validation – Nested Cross Validation for Model Selection: Best Practices

update @user99889's question: What to do if outer CV finds instability?

Solved – How to build the final model and tune probability threshold after nested cross-validation

Nested cross validation explained without nesting

probability threshold

Specific Comments to parts of the question

Related Question