Part of the issue for #1 is terminology. We usually think of the training and test sets as the initial splitting that is done when you have assembled and cleaned your data. Resampling only happens on the training set; the test set is left for a final, unbiased evaluation of the model once you have singled one out as being the best.
When resampling, I have been using different terminology for the data used in the model and for the data held-out for immediate prediction. I call those the analysis and assessment sets respectively. So for simple 10-fold CV, each analysis set is 90% of the training set and the assessment set is 10%.
Your point about inefficient use of data with a training and test set is one complaint that I've heard over the years. However, it is good scientific practice to have a confirmatory data set that is only used to reaffirm the results that you obtained during the modeling process. There are ways to do resampling incorrectly and you would not know that this has occurred until you evaluate the next set of samples (that were not involved in the preceding analysis). Your point is valid but, unless your entire data set is pathologically small, the use of a test set far outweighs the inefficiency caused by the smaller training set.
For #2, the only way to really know when you are overfitting is with a separate data set (such as the assessment set). Whether that comes from nested resampling or non-nested (please don't call it the caret
method), using the model to predict other samples is the only way to tell.
For #3, the process that I generally give to people is to do an initial training/test split, then resample the training set (using the same analysis/assessment splits across all testing). I generally use non-nested resampling (I'm the one who wrote caret
) but nested sampling can be used too (more on that below). Executing the resampling process across different tuning parameters can be very effective at helping choose parameter values since overfitting is reflected in those statistics. Once you've settled on parameter values, the final model is refit on the entire training set.
Think of the process like this: the model-related operations are a module and this module can be applied to any data set. Resampling is a methods of estimating the performance of that module and was invented to emulate what the results would be for the module fit on the entire training set. Even though resampling can sometimes used less data when the module is repeatedly evaluated, it is still a good estimator of the final model that uses all the training data.
The documentation for the rsample
package shows this at a more nuts-and-bolts level. For example, this page shows a neural network being tuned across epochs using simple 10-fold CV. In that example, you can see that the assessment sets (which would capture the effect of overfitting) are used to measure performance.
About nesting versus non-nesting: the main worry in non-nested resampling is optimization bias. If we evaluate a large number of tuning parameter values, there is some bias that we get by just choosing the best value. We are likely to be optimistic in our performance estimate. That is a real pattern and it is shown nicely in the papers that discuss it.
However... my experience is that, although real, this bias is very small in most cases (especially when compared to the experimental noise). I have yet to see a real data set when the non-nested resampling gave pathologically optimistic estimates. This vignette has a simulated case-study using rsample
that is a good demonstration. If the cost of nested resampling were not so high, I would definitely be using it more often.
Best Answer
Short answer: it is neither wrong nor new.
We've been discussing this validation scheme under the name "set validation" ≈ 15 a ago when preparing a paper*, but in the end never actually referred to it as we didn't find it used in practice.
Wikipedia refers to the same validation scheme as repeated random sub-sampling validation or Monte Carlo cross validation
From a theory point of view, the concept was of interest to us because
* Beleites, C.; Baumgartner, R.; Bowman, C.; Somorjai, R.; Steiner, G.; Salzer, R. & Sowa, M. G. Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).
The "set validation" error for N = 1 is hidden in fig. 6 (i.e. its bias + variance can be recostructed from the data given but are not explicitly given.)
Well, in the paper above we found the total error (bias² + variance) of out-of-bootstrap and repeated/iterated $k$-fold cross validation to be pretty similar (with oob having somewhat lower variance but higher bias - but we did not follow up to check whether/how much of this trade-off is due resampling with/without replacement and how much is due to the different split ratio of about 1 : 2 for oob).
Keep in mind, though, that I'm talking about accuracy in small sample size situations, where the dominating contributor to variance uncertainty is the same for all resampling schemes: the limited number of true samples for testing, and that is the same for oob, cross validation or set validation. Iterations/repetitions allow you to reduce the variance caused by instability of the (surrogate) models, but not the variance uncertainty due to the limited total sample size.
Thus, assuming that you perform an adequately large number of iterations/repetitions N, I'd not expect practically relevant differences in the performance of these validation schemes.
One validation scheme may fit better with the scenario you try to simulate by the resampling, though.