How do I choose a model from this [outer cross validation] output?
Short answer: You don't.
Treat the inner cross validation as part of the model fitting procedure. That means that the fitting including the fitting of the hyper-parameters (this is where the inner cross validation hides) is just like any other model esitmation routine.
The outer cross validation estimates the performance of this model fitting approach. For that you use the usual assumptions
- the $k$ outer surrogate models are equivalent to the "real" model built by
model.fitting.procedure
with all data.
- Or, in case 1. breaks down (pessimistic bias of resampling validation), at least the $k$ outer surrogate models are equivalent to each other.
This allows you to pool (average) the test results. It also means that you do not need to choose among them as you assume that they are basically the same.
The breaking down of this second, weaker assumption is model instability.
Do not pick the seemingly best of the $k$ surrogate models - that would usually be just "harvesting" testing uncertainty and leads to an optimistic bias.
So how can I use nested CV for model selection?
The inner CV does the selection.
It looks to me that selecting the best model out of those K winning models would not be a fair comparison since each model was trained and tested on different parts of the dataset.
You are right in that it is no good idea to pick one of the $k$ surrogate models. But you are wrong about the reason. Real reason: see above. The fact that they are not trained and tested on the same data does not "hurt" here.
- Not having the same testing data: as you want to claim afterwards that the test results generalize to never seen data, this cannot make a difference.
- Not having the same training data:
- if the models are stable, this doesn't make a difference: Stable here means that the model does not change (much) if the training data is "perturbed" by replacing a few cases by other cases.
- if the models are not stable, three considerations are important:
- you can actually measure whether and to which extent this is the case, by using iterated/repeated $k$-fold cross validation. That allows you to compare cross validation results for the same case that were predicted by different models built on slightly differing training data.
- If the models are not stable, the variance observed over the test results of the $k$-fold cross validation increases: you do not only have the variance due to the fact that only a finite number of cases is tested in total, but have additional variance due to the instability of the models (variance in the predictive abilities).
- If instability is a real problem, you cannot extrapolate well to the performance for the "real" model.
Which brings me to your last question:
What types of analysis /checks can I do with the scores that I get from the outer K folds?
- check for stability of the predictions (use iterated/repeated cross-validation)
check for the stability/variation of the optimized hyper-parameters.
For one thing, wildly scattering hyper-parameters may indicate that the inner optimization didn't work. For another thing, this may allow you to decide on the hyperparameters without the costly optimization step in similar situations in the future. With costly I do not refer to computational resources but to the fact that this "costs" information that may better be used for estimating the "normal" model parameters.
check for the difference between the inner and outer estimate of the chosen model. If there is a large difference (the inner being very overoptimistic), there is a risk that the inner optimization didn't work well because of overfitting.
update @user99889's question: What to do if outer CV finds instability?
First of all, detecting in the outer CV loop that the models do not yield stable predictions in that respect doesn't really differ from detecting that the prediciton error is too high for the application. It is one of the possible outcomes of model validation (or verification) implying that the model we have is not fit for its purpose.
In the comment answering @davips, I was thinking of tackling the instability in the inner CV - i.e. as part of the model optimization process.
But you are certainly right: if we change our model based on the findings of the outer CV, yet another round of independent testing of the changed model is necessary.
However, instability in the outer CV would also be a sign that the optimization wasn't set up well - so finding instability in the outer CV implies that the inner CV did not penalize instability in the necessary fashion - this would be my main point of critique in such a situation. In other words, why does the optimization allow/lead to heavily overfit models?
However, there is one peculiarity here that IMHO may excuse the further change of the "final" model after careful consideration of the exact circumstances: As we did detect overfitting, any proposed change (fewer d.f./more restrictive or aggregation) to the model would be in direction of less overfitting (or at least hyperparameters that are less prone to overfitting). The point of independent testing is to detect overfitting - underfitting can be detected by data that was already used in the training process.
So if we are talking, say, about further reducing the number of latent variables in a PLS model that would be comparably benign (if the proposed change would be a totally different type of model, say PLS instead of SVM, all bets would be off), and I'd be even more relaxed about it if I'd know that we are anyways in an intermediate stage of modeling - after all, if the optimized models are still unstable, there's no question that more cases are needed. Also, in many situations, you'll eventually need to perform studies that are designed to properly test various aspects of performance (e.g. generalization to data acquired in the future).
Still, I'd insist that the full modeling process would need to be reported, and that the implications of these late changes would need to be carefully discussed.
Also, aggregation including and out-of-bag analogue CV estimate of performance would be possible from the already available results - which is the other type of "post-processing" of the model that I'd be willing to consider benign here. Yet again, it then would have been better if the study were designed from the beginning to check that aggregation provides no advantage over individual predcitions (which is another way of saying that the individual models are stable).
Update (2019): the more I think about these situations, the more I come to favor the "nested cross validation apparently without nesting" approach.
Nested cross validation explained without nesting
Here's how I see (nested) cross validation and model building. Note that I'm chemist and like you look from the application side to the model building process (see below). My main point here is from my point of view I don't need a dedicated nested variety of cross validation. I need a validation method (e.g. cross validation) and a model training function:
model = f (training data)
"my" model training function f
does not need any hyperparameters because it internally does all hyperparameter tuning (e.g. your alpha
, lambda
and threshold
).
In other words, my training function may contain any number of inner cross validations (or out-of-bag or what ever performance estimate I may deem useful). However, note that the distinction between parameters and hyper-parameters typically is that the hyperparameters need to be tuned to the data set/application at hand whereas the parameters can then be fitted regardless of what data it is. Thus from the point of view of the developer of a new classification algorithm, it does make sense to provide only the "naked" fitting function (g (training data, hyperparameters)
) that fits the parameters if given data and hyperparameters.
The point of having the "outer" training function f
is that after you did your cross validation run, it gives you straightforward way to train "on the whole data set": just use f (whole data set)
instead of the call f (cv split training data)
for the cross validation surrogate models.
Thus in your example, you'll have 5+1 calls to f
, and each of the calls to f
will have e.g. 100 * 5 calls to g
.
probability threshold
While you could do this with yet another cross validation, this is not necessary: it is just one more hyperparameter your ready-to-use model has and can be estimated inside f
.
What you need to fix it is a heuristic that allows you to calculate such a threshold. There's a wide variety of heuristics (from ROC and specifying how important it is to avoid false positives compared to false negatives over minimum acceptable sensitivity or specificity or PPV or NPV to allowing two thresholds and thus an "uncertain" (NA) level and so on) that are suitable in different situations - good heuristics are usually very application specific.
But for the question here, you can do this inside f
and e.g. using the predictions obtained during the inner cross validation to calculate ROC and then find your working point/threshold accordingly.
Specific Comments to parts of the question
I understand that I shouldn't report the performance from the CV used to pick the optimal hyperparameters as an estimate of the expected performance of my final model (which would be overly-optimistic) but should instead include an outer CV loop to get this estimate.
Yes. (Though the inner estimate does carry information in relation to the outer estimate: if it is much more optimisitc than the outer estimate, you are typically overfitting.)
I understand that the inner CV loop is used for model selection
Any kind of data-driven model tuning, really -> that includes tuning your cutoff-threshold.
(in this case, the optimal hyperparameters) and that the outer loop is used for model evaluation, i.e., the inner and outer CV serve two different purposes that often are erroneously conflated.
Yes.
That is, the hyperparameter tuning is part of "the method for building the model".
I prefer to see it this way as well: I'm chemist and like you look from the application side: for me a trained/fitted model is not complete without the hyperparameters, or more precisely, a model is something I can use directly to obtain predictions.
Though as you note other people have a different view (without hyperparameter tuning). In my experience, this is often the case with people developing new models: hyperparameter tuning is then a "solved problem" and not considered. (side note: their view on what cross validation can do in terms of validation is also slightly different from what cross validation can do from the application side).
Best Answer
It sounds like you're taking the correct approach - you'll want to do a nested CV so that you tune your parameters on the inner dataset, and then estimate the error on a holdout set that the model has never seen before. As an example:
Divide your training set into 10 folds. Use the 9 folds to tune your model (again through CV), and then estimate the error on the 10th fold that you held out. You can do this 10 times to get an estimate of the error (you could actually do it another set of 10 times if you randomly generated a different set of folds).
The Elements of Statistical Learning explicitly warns against "[Using] cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model." - emphasis mine (see Ch 7, Section 10.2). You can of course use CV to estimate these separately.
If you need another a citation for the importance of this, some researchers at Google just released a paper related to this. If you have access to the paper on Science, they even released their Python code along with the article.