The X-validation method LeaveMOut it is a common strategy. In fact, when modelling a specific classifier, LeaveMOut allows you to create training and testing data easily. As this procedure is repeated several times randomly, you average the performance.
However, LeaveMOut is a kind of k-fold cross validation where (k-1) folds are used for training and the remaining fold is used for testing.
There is no such a big difference. Maybe the only difference is that LeaveMOut does not allow validation data, as it only leaves M samples out of the training data. When it said that using LMO on a loop does not guarantee disjointed evaluation sets, it means that samples may be in both places every loop and this maybe conflictive for some applications (not in general, in my opinion).
Your understanding sounds good to me with the possible exception that what you call "run" in my field is called either fold (as in 5-fold cross validation) if the test data is meant or "surrogate model" if we're talking about the model.
Yes, the outer folds can return different hyperparameter sets and/or parameters (coefficients).
This is valid in the sense that this is allowed to happen. It is invalid in the sense that this means the optimization (done with the help of the inner folds) is not stable, so you have not actually found "the" [global] optimum.
For the overall model, you're supposed to run the inner cross validation again on the whole data set. I.e., you optimize/auto-tune your hyperparameters on the training set (now the whole data set) just the same as you did during the outer cross validation.
Update: longer explanation
See also Nested cross validation for model selection and How to build the final model and tune probability threshold after nested cross-validation?
Are you saying that to get the "the [global] optimum" you need to run your entire dataset on all the combinations of c's, gamma's, kernels etc?
No. In my experience the problem is not that the search space is not explored in detail (all possible combinations) but rather that our measurement of the resulting model performance is subject to uncertainty.
Many optimization strategies coming from numerical optimization implicitly assume that there is negligible noise on the target functional. I.e. the functional is basically a smooth, continuous function of the hyperparameters. Depending on the figure of merit you optimize and the number of cases you have, this assumption may or may not be met.
If you do have considerable noise on the estimate of the figure of merit but do not take this into account (i.e. the "select the best one" strategy you mention), your observed "optimum" is subject to noise.
In addition, the noise (variance uncertainty) on the performance estimate increases with model complexity. In this situation, naive "select be best observed performance" can also lead to a bias towards too complex models.
See e.g. Cawley, G. C. & Talbot, N. L. C.: On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, Journal of Machine Learning Research, 11, 2079-2107 (2010).
How does this get incorporated into the nested cross validation procedure or the final results of the analysis?
Hastie, T. and Tibshirani, R. and Friedman, J. The Elements of Statistical Learning; Data mining, Inference andPrediction Springer Verlag, New York, 2009 in chapter 7.10 say:
Often a “one-standard
error” rule is used with cross-validation, in which we choose the most par-
simonious model whose error is no more than one standard error above
the error of the best model.
Which I find a good heuristic (I take the additional precaution to estimate both variance uncertainty due to the limited numebr of cases as well as due to model instability - the Elements of Statistical Learning do not discuss this in their cross validation chapter).
So your understanding:
I'm confused because my understanding is that you can't just run your analysis hundreds/thousands of times with different parameters/kernels and select the best one
is correct.
However, your understanding
(and nested CV is supposed to mitigate the associated issues).
may or may not be correct:
- nested CV does not make the hyperparameter optimization any more successful,
but it can provide an honest estimate of the performance that can be achieved with that particular optimization strategy.
In other words: it guards against overoptimism about the achieved performance, but it does not improve this performance.
The final model:
- The outer split of the nested CV is basically an ordinary CV for validation/verification. It splits the availabel data set into training and testing subsets, and then builds a so-called surrogate model using the training set.
- During this training, you happen to do another (the inner) CV, whose performance estimates you use to fix/optimize the hyperparameters. But seen from the outer CV, this is just part of the model training.
The model training on the whole data set should just do the same the model training of the cross validation did. Otherwise the surrogate models and their performance estimates would not be a good surrogate for the model trained on the whole data (and that is really the purpose of the surrogate models).
Thus: run the auto-tuning of hyperparameters on the whole data set just as you do during cross validation. Same hyperparameter combinations to consider, same strategy for selecting the optimum. In short: same training algorithm, just slightly different data (1/k additional cases).
Best Answer
I think, you need to provide more information, as there are a lot of possible causes:
What you should do:
Is anything special about your data/classifier compared to other classification problems you have done?