How do I choose a model from this [outer cross validation] output?
Short answer: You don't.
Treat the inner cross validation as part of the model fitting procedure. That means that the fitting including the fitting of the hyper-parameters (this is where the inner cross validation hides) is just like any other model esitmation routine.
The outer cross validation estimates the performance of this model fitting approach. For that you use the usual assumptions
- the $k$ outer surrogate models are equivalent to the "real" model built by
model.fitting.procedure
with all data.
- Or, in case 1. breaks down (pessimistic bias of resampling validation), at least the $k$ outer surrogate models are equivalent to each other.
This allows you to pool (average) the test results. It also means that you do not need to choose among them as you assume that they are basically the same.
The breaking down of this second, weaker assumption is model instability.
Do not pick the seemingly best of the $k$ surrogate models - that would usually be just "harvesting" testing uncertainty and leads to an optimistic bias.
So how can I use nested CV for model selection?
The inner CV does the selection.
It looks to me that selecting the best model out of those K winning models would not be a fair comparison since each model was trained and tested on different parts of the dataset.
You are right in that it is no good idea to pick one of the $k$ surrogate models. But you are wrong about the reason. Real reason: see above. The fact that they are not trained and tested on the same data does not "hurt" here.
- Not having the same testing data: as you want to claim afterwards that the test results generalize to never seen data, this cannot make a difference.
- Not having the same training data:
- if the models are stable, this doesn't make a difference: Stable here means that the model does not change (much) if the training data is "perturbed" by replacing a few cases by other cases.
- if the models are not stable, three considerations are important:
- you can actually measure whether and to which extent this is the case, by using iterated/repeated $k$-fold cross validation. That allows you to compare cross validation results for the same case that were predicted by different models built on slightly differing training data.
- If the models are not stable, the variance observed over the test results of the $k$-fold cross validation increases: you do not only have the variance due to the fact that only a finite number of cases is tested in total, but have additional variance due to the instability of the models (variance in the predictive abilities).
- If instability is a real problem, you cannot extrapolate well to the performance for the "real" model.
Which brings me to your last question:
What types of analysis /checks can I do with the scores that I get from the outer K folds?
- check for stability of the predictions (use iterated/repeated cross-validation)
check for the stability/variation of the optimized hyper-parameters.
For one thing, wildly scattering hyper-parameters may indicate that the inner optimization didn't work. For another thing, this may allow you to decide on the hyperparameters without the costly optimization step in similar situations in the future. With costly I do not refer to computational resources but to the fact that this "costs" information that may better be used for estimating the "normal" model parameters.
check for the difference between the inner and outer estimate of the chosen model. If there is a large difference (the inner being very overoptimistic), there is a risk that the inner optimization didn't work well because of overfitting.
update @user99889's question: What to do if outer CV finds instability?
First of all, detecting in the outer CV loop that the models do not yield stable predictions in that respect doesn't really differ from detecting that the prediciton error is too high for the application. It is one of the possible outcomes of model validation (or verification) implying that the model we have is not fit for its purpose.
In the comment answering @davips, I was thinking of tackling the instability in the inner CV - i.e. as part of the model optimization process.
But you are certainly right: if we change our model based on the findings of the outer CV, yet another round of independent testing of the changed model is necessary.
However, instability in the outer CV would also be a sign that the optimization wasn't set up well - so finding instability in the outer CV implies that the inner CV did not penalize instability in the necessary fashion - this would be my main point of critique in such a situation. In other words, why does the optimization allow/lead to heavily overfit models?
However, there is one peculiarity here that IMHO may excuse the further change of the "final" model after careful consideration of the exact circumstances: As we did detect overfitting, any proposed change (fewer d.f./more restrictive or aggregation) to the model would be in direction of less overfitting (or at least hyperparameters that are less prone to overfitting). The point of independent testing is to detect overfitting - underfitting can be detected by data that was already used in the training process.
So if we are talking, say, about further reducing the number of latent variables in a PLS model that would be comparably benign (if the proposed change would be a totally different type of model, say PLS instead of SVM, all bets would be off), and I'd be even more relaxed about it if I'd know that we are anyways in an intermediate stage of modeling - after all, if the optimized models are still unstable, there's no question that more cases are needed. Also, in many situations, you'll eventually need to perform studies that are designed to properly test various aspects of performance (e.g. generalization to data acquired in the future).
Still, I'd insist that the full modeling process would need to be reported, and that the implications of these late changes would need to be carefully discussed.
Also, aggregation including and out-of-bag analogue CV estimate of performance would be possible from the already available results - which is the other type of "post-processing" of the model that I'd be willing to consider benign here. Yet again, it then would have been better if the study were designed from the beginning to check that aggregation provides no advantage over individual predcitions (which is another way of saying that the individual models are stable).
Update (2019): the more I think about these situations, the more I come to favor the "nested cross validation apparently without nesting" approach.
(I'm sure I wrote most of this already in some answer - but can't find it right now. If anyone stumbles across that answer, please link it).
I see 2 slightly different approaches here, which I think are both sensible.
But first some terminology:
- Coming from an applied field, a (fitted/trained) model for me is a ready-to-use. I.e. the model contains all information needed to generate predictions for new data. Thus, the model contains also the hyperparameters. As you will see, this point of view is closely related to approach 2 below.
- OTOH, training algorithm in my experience is not well defined in the following sense: in order to get the (fitted) model, not only the - let's call it "primary fitting" - of the "normal" model parameters needs to be done, but also the hyperparameters need to be fixed. From my application perspective, there isn't really much difference between parameters and hyperparamers: both are part of the model, and need to be estimated/decided during training.
I guess the difference between them is related to the difference between someone developing new training algorithms who'd usually describe a class of training algorithms together with some steering parameters (the hyperparameters) which are difficult/impossibe to fix (or at least to fix how they should be decided/estimated) without application/domain knowledge.
Approach 1: require stable optimization results
With this approach, "model training" is the fitting of the "normal" model parameters, and hyperparameters are given. An inner e.g. cross validation takes care of the hyperparameter optimization.
The crucial step/assumption here to solve the dilemma of whose hyperparameter set should be chosen is to require the optimization to be stable. Cross validation for validation purposes assumes that all surrogate models are sufficiently similar to the final model (obtained by the same training algorithm applied to the whole data set) to allow treating them as equal (among themselves as well as to the final model). If this assumption breaks down and
the surrogate models are still equal (or equivalent) among themselves but not to the final model, we are talking about the well-known pessimistic bias of cross validation.
If also the surrogate model are not equal/equivalent to each other, we have problems with instability.
For the optimization results of the inner loop this means that if the optimization is stable, there is no conflict in choosing hyperparameters. And if considerable variation is observed across the inner cross validation results, the optimization is not stable. Unstable training situations have far worse problems than just the decision which of the hyperparameter sets to choose, and I'd really recommend to step back in that case and start the
modeling process all over.
There's an exception, here, though: there may be several local minima in the optimization yielding equal performance for practical purposes. Requiring also the choice among them to be stable may be an unnecessary strong requirement - but I don't know how to get out of this dilemma.
Note that if not all models yield the same winning parameter set, you should not use outer loop estimates as generalization error here:
- If you claim generalization error for parameters $p$, all surrogate models entering into the validation should actually use exactly these parameters.
(Imagine someone told you they did a cross validation on model with C = 1 and linear kernel and you find out some splits were evaluated with rbf kernel!)
- But unless there is no decision involved as all splits yielded the same parameters, this will break independence in the outer loop: the test data of each split already entered the decision which parameter set wins as it was training data in all other splits and thus used to optimize the parameters.
Approach 2: treat hyperparameter tuning as part of the model training
This approach bridges the perspectives of the "training algorithm developer" and applied user of the training algorithm.
The training algorithm developer provides a "naked" training algorithm model = train_naked (trainingdata, hyperparameters)
. As the applied user needs tunedmodel = train_tuned (trainingdata)
which also takes care of fixing the hyperparameters.
train_tuned
can be implemented e.g. by wrapping a cross validation-based optimizer around the naked training algorithm train_naked
.
train_tuned
can then be used like any other training algorithm that does not require hyperparameter input, e.g. its output tunedmodel
can be subjected to cross validation. Now the hyperparameters are checked for their stability just like the "normal" parameters should be checked for stability as part of the evaluation of the cross validation.
This is actually what you do and evaluate in the nested cross validation if you average performance of all winning models regardless of their individual parameter sets.
What's the difference?
We possibly end up with different final models taking those 2 approaches:
- the final model in approach 1 will be
train_naked (all data, hyperparameters from optimization)
- whereas approach 2 will use
train_tuned (all data)
and - as that runs the hyperparameter optimization again on the larger data set - this may end up with a different set of hyperparameters.
But again the same logic applies: if we find that the final model has substantially different parameters from the cross validation surrogate models, that's a symptom of assumption 1 being violated. So IMHO, again we do not have a conflict but rather a check on whether our (implicit) assumptions are justified. And if they aren't, we anyways should not bet too much on having a good estimate of the performance of that final model.
I have the impression (also from seeing the number of similar questions/confusions here on CV) that many people think of nested cross validation doing approach 1.
But generalization error is usually estimated according to approach 2, so that's the way to go for the final model as well.
Iris example
Summary: The optimization is basically pointless. The available sample size does not allow distinctions between the performance of any of the parameter sets here.
From the application point of view, however, the conclusion is that it doesn't matter which of the 4 parameter sets you choose - which isn't all that bad news: you found a comparatively stable plateau of parameters. Here comes the advantage of the proper nested validation of the tuned model: while you're not able to claim that the it is the optimal model, your're still able to claim that the model built on the whole data using approach 2 will have about 97 % accuracy (95 % confidence interval for 145 correct out of 150 test cases: 92 - 99 %)
Note that also approach 1 isn't as far off as it seems - see below: your optimization accidentally missed a comparatively clear "winner" because of ties (that's actually another very telltale symptom of the sample size problem).
While I'm not deep enough into SVMs to "see" that C = 1 should be a good choice here, I'd go with the more restrictive linear kernel. Also, as you did the optimization, there's nothing wrong with choosing the winning parameter set even if you are aware that all parameter sets lead to practically equal performance.
In future, however, consider whether your experience yields rough guesstimates of what performance you can expect and roughly what model would be a good choice. Then build that model (with manually fixed hyperparameters) and calculate a confidence interval for its performance. Use this to decide whether trying to optimize is sensible at all. (I may add that I'm mostly working with data where getting 10 more independent cases is not easy - if you are in a field with large independent sample sizes, things look much better for you)
long version:
As for the example results on the iris
data set. iris
has 150 cases, SVM with a grid of 2 x 2 parameters (2 kernels, 2 orders of magnitude for the penalty C
) are considered.
The inner loop has splits of 129 (2x) and 132 (6x) cases.
The "best" parameter set is undecided between linear or rbf kernel, both with C = 1. However, the inner test accuracies are all (including the always loosing C = 10) within 94 - 98.5 % observed accuracy. The largest difference we have in one of the splits is 3 vs. 8 errors for rbf with C = 1 vs. 10.
There's no way this is a significant difference. I don't know how to extract the predictions for the individual cases in the CV, but even assuming that the 3 errors were shared, and the C = 10 model made additional 5 errors:
> table (rbf1, rbf10)
rbf10
rbf1 correct wrong
correct 124 5
wrong 0 3
> mcnemar.exact(rbf1, rbf10)
Exact McNemar test (with central confidence intervals)
data: rbf1 and rbf10
b = 5, c = 0, p-value = 0.0625
alternative hypothesis: true odds ratio is not equal to 1
Remember that there are 6 pairwise comparisons in the 2 x 2 grid, so we'd need to correct for multiple comparisons as well.
Approach 1
In 3 of the 4 outer splits where rbf "won" over the linear kernel, they actually had the same estimated accuracy (I guess min in case of ties returns the first suitable index).
Changing the grid to
params = {'kernel':['linear', 'rbf'],'C':[1,10]}
yields
({'kernel': 'linear', 'C': 1}, 0.95238095238095233, 0.97674418604651159)
({'kernel': 'rbf', 'C': 1}, 0.95238095238095233, 0.98449612403100772)
({'kernel': 'linear', 'C': 1}, 1.0, 0.97727272727272729)
({'kernel': 'linear', 'C': 1}, 0.94444444444444442, 0.98484848484848486)
({'kernel': 'linear', 'C': 1}, 0.94444444444444442, 0.98484848484848486)
({'kernel': 'linear', 'C': 1}, 1.0, 0.98484848484848486)
({'kernel': 'linear', 'C': 1}, 1.0, 0.96212121212121215)
Approach 2:
Here, clf
is your final model. With random_state = 2
, rbf with C = 1 wins:
In [310]: clf.grid_scores_
[...snip warning...]
Out[310]:
[mean: 0.97333, std: 0.00897, params: {'kernel': 'linear', 'C': 1},
mean: 0.98000, std: 0.02773, params: {'kernel': 'rbf', 'C': 1},
mean: 0.96000, std: 0.03202, params: {'kernel': 'linear', 'C': 10},
mean: 0.95333, std: 0.01791, params: {'kernel': 'rbf', 'C': 10}]
(happens about 1 in 5 times, 1 in 6 times linear
and rbf
with C = 1
are tied on rank 1)
Best Answer
I figured out where my understanding was off, figured I should answer my question in case anyone else stumbles upon it.
To start, sklearn makes nested cross-validation deceptively easy. I read their example over and over but never got it until I looked at the extremely helpful pseudocode given in the answer to this question.
Briefly, this is what I had to do (which is almost a copy of the example scikit-learn gives):
In code, this is kind of how it looks:
cross_val_score will split into a training/test set and do a randomized search on that training set, which itself splits into a test/training set, generates the scores, then goes back up to cross_val_score to test and move on to the next test/training set.
AFTER you do this, you'll get a bunch of cross-validation scores. My original question was: "what do you get/do now?" Nested cross-validation is not for model selection. What I mean by that, is that you're not trying to get parameter values that are good for your final model. That's what the inner RandomizedSearchCV is for.
But of course, if you are using something like a RandomForest for feature selection in your pipeline, then you'd expect a different set of parameters each time! So what do you really get that's useful?
Nested cross-validation is to give an unbiased estimate as to how good your methodology/series of steps is. What is "good"? Good is defined by the stability of hyperparameters and the cross-validation scores you ultimately get. Say you get numbers like I did: I got cross-validation scores of: [0.57027027, 0.48918919, 0.37297297, 0.74444444, 0.53703704]. So depending on the mood of my method of doing things, I can get an ROC score between 0.37 and 0.74 — obviously this is undesirable. If you were to look at my hyper-parameters, you'd see that the "optimal" hyper-parameters vary wildly. Whereas if I got consistent cross-validation scores that were high, and the optimal hyper-parameters were all in the same ballpark, I can be fairly confident that the way I am choosing to select features and model my data is pretty good.
If you have instability—I am not sure what you can do. I'm still new to this—the gurus on this board probably have better advice other than blindly changing your methodology.
But if you have stability, what's next? This is another important aspect that I neglected to understand: a really good and predictive and generalizable model created by your training data is NOT the final model. But it's close. The final model uses all of your data, because you're done testing and optimizing and tweaking (yeah, if you'd try and cross-validate a model with data you used to fit it, you'd get a biased result, but why would you cross-validate it at this point? You've already done that, and hopefully a bias issue doesn't exist)—you give it all the data you can so it can make the most informed decisions it can, and the next time you'll see how well your model does, is when it's in the wild, using data that neither you nor the model has ever seen before.
I hope this helps someone. For some reason it took me a really long time to wrap my head around this, and here are some other links I used to understand:
http://www.pnas.org/content/99/10/6562.full.pdf — A paper that re-examines data and conclusions drawn by other genetics papers that don't use nested cross-validation for feature selection/hyper-parameter selection. It's somewhat comforting to know that even super smart and accomplished people also get swindled by statistics from time to time.
http://jmlr.org/papers/volume11/cawley10a/cawley10a.pdf — iirc, I've seen an author to this author answer a ton of questions about this topic on this forum
Training with the full dataset after cross-validation? — One of the aforementioned authors answering a similar question in a more colloquial manner.
http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html — the sklearn example