Solved – How to build the final model and tune probability threshold after nested cross-validation

cross-validationglmnethyperparametermachine learningmodel selection

Firstly, apologies for posting a question that has already been discussed at length here, here, here, here, here, and for reheating an old topic. I know @DikranMarsupial has written about this topic at length in posts and journal papers, but I'm still confused, and judging from the number of similar posts here, it's still something that others struggle to grasp. I should also state that I've received contradictory on this topic which has added to my confusion. You should also know that I'm originally a physicist and not a statistician, so my domain expertise here is somewhat limited. I'm writing a journal paper in which I want to use nested CV to estimate the performance I can expect from my final model. In my domain, this is a first. (We almost never use any form of robust CV in my field but merrily pump out papers with results from studies using neural nets and boosted decision trees!) Therefore, it's very important that I have a very thorough and clear understanding so that I don't screw-up and propagate an erroneous procedure to my community that could years to unlearn! Thanks! On with the question…

How do I build the final model after nested cross-validation?

I'm training a simple glmnet model with L1 and L2 regularisation. It's fast, simple and interpretable. I perform feature centring, scaling, and Box-Cox transformations to that the feature distributions are mean-centred, standardised and are somewhat Gaussian-like. I perform this step within cross-validation, to prevent information leakage. Purely because my hardware is incredibly slow and I don't have access to more CPU muscle, I also perform fast filter-based feature selection within CV after feature preprocessing. I'm using random grid search to pick the alpha and lambda hyperparameters. I understand that I shouldn't report the performance from the CV used to pick the optimal hyperparameters as an estimate of the expected performance of my final model (which would be overly-optimistic) but should instead include an outer CV loop to get this estimate. I understand that the inner CV loop is used for model selection (in this case, the optimal hyperparameters) and that the outer loop is used for model evaluation, i.e., the inner and outer CV serve two different purposes that often are erroneously conflated. (How am I doing so far?)

Now, the links I've posted suggest that "the way to think of cross-validation is as estimating the performance obtained using a method for building a model, rather than for estimating the performance of a model". Given that, how should I interpret the results of the nested CV procedure?

The advice I've read seems to indicate the following — please correct me if this is wrong: the inner CV is part of the mechanism that allows me to select the optimal alpha and lambda hyperparameters of my glmnet model. The outer CV tells the estimate I can expect to get from the final model if I apply the procedure exactly as used in the inner CV including hyperparameter tuning and using the entire dataset to build the final model. That is, the hyperparameter tuning is part of "the method for building the model". Is this correct or not? Because this is what confuses me. Elsewhere I've seen that the procedure for building the final model to be deployed involves training on the entire dataset using the fixed values of the hyperparameters that have been chosen using CV. Here, the "method for building the model" does not include tuning. So, which is it? At some point the optimal hyperparameters are chosen and fixed for building the final model! Where? How? If my inner loop is 5-fold CV, and my outer loop is 5-fold CV, and I select, say, 100 points for testing as part of random grid search in the inner CV, how many times do I actually train the glmnet model? (100 * 5 * 5) + 1 for the final build, or are there more steps that I'm unaware of?

Basically, I need a very clear description of how to interpret the performance estimate from nested CV and how to build the final model.

I also would like to know the appropriate procedure for selecting the probability threshold for converting the probability scores from my final glmnet model into (binary) class labels — another loop of CV is needed?

Best Answer

Nested cross validation explained without nesting

Here's how I see (nested) cross validation and model building. Note that I'm chemist and like you look from the application side to the model building process (see below). My main point here is from my point of view I don't need a dedicated nested variety of cross validation. I need a validation method (e.g. cross validation) and a model training function:

model = f (training data)

"my" model training function f does not need any hyperparameters because it internally does all hyperparameter tuning (e.g. your alpha, lambda and threshold).

In other words, my training function may contain any number of inner cross validations (or out-of-bag or what ever performance estimate I may deem useful). However, note that the distinction between parameters and hyper-parameters typically is that the hyperparameters need to be tuned to the data set/application at hand whereas the parameters can then be fitted regardless of what data it is. Thus from the point of view of the developer of a new classification algorithm, it does make sense to provide only the "naked" fitting function (g (training data, hyperparameters)) that fits the parameters if given data and hyperparameters.

The point of having the "outer" training function f is that after you did your cross validation run, it gives you straightforward way to train "on the whole data set": just use f (whole data set) instead of the call f (cv split training data) for the cross validation surrogate models.

Thus in your example, you'll have 5+1 calls to f, and each of the calls to f will have e.g. 100 * 5 calls to g.


probability threshold

While you could do this with yet another cross validation, this is not necessary: it is just one more hyperparameter your ready-to-use model has and can be estimated inside f.

What you need to fix it is a heuristic that allows you to calculate such a threshold. There's a wide variety of heuristics (from ROC and specifying how important it is to avoid false positives compared to false negatives over minimum acceptable sensitivity or specificity or PPV or NPV to allowing two thresholds and thus an "uncertain" (NA) level and so on) that are suitable in different situations - good heuristics are usually very application specific.

But for the question here, you can do this inside f and e.g. using the predictions obtained during the inner cross validation to calculate ROC and then find your working point/threshold accordingly.


Specific Comments to parts of the question

I understand that I shouldn't report the performance from the CV used to pick the optimal hyperparameters as an estimate of the expected performance of my final model (which would be overly-optimistic) but should instead include an outer CV loop to get this estimate.

Yes. (Though the inner estimate does carry information in relation to the outer estimate: if it is much more optimisitc than the outer estimate, you are typically overfitting.)

I understand that the inner CV loop is used for model selection

Any kind of data-driven model tuning, really -> that includes tuning your cutoff-threshold.

(in this case, the optimal hyperparameters) and that the outer loop is used for model evaluation, i.e., the inner and outer CV serve two different purposes that often are erroneously conflated.

Yes.

That is, the hyperparameter tuning is part of "the method for building the model".

I prefer to see it this way as well: I'm chemist and like you look from the application side: for me a trained/fitted model is not complete without the hyperparameters, or more precisely, a model is something I can use directly to obtain predictions. Though as you note other people have a different view (without hyperparameter tuning). In my experience, this is often the case with people developing new models: hyperparameter tuning is then a "solved problem" and not considered. (side note: their view on what cross validation can do in terms of validation is also slightly different from what cross validation can do from the application side).