So, if you had 8 training set samples would this scheme result in choose(8,2) = 28
resamples? Also, I'm assuming that this isn't two nested leave-one-out loops.
If so, here is a solution that might breakdown with large sample sizes.
num_samps <- 8
holdout <- combn(num_samps, 2)
in_training <- apply(holdout, 2,
function(x, all) all[!(all %in% x)],
all = 1:num_samps)
## need a more effcient way of doing this:
index <- vector(mode = "list", length = ncol(in_training))
for(i in 1:ncol(in_training)) index[[i]] <- in_training[,i]
## cosmetic:
names(index) <- caret:::prettySeq(seq(along = index))
ctrl <- trainControl(method = "cv", ## this will be ignored since
## we supply index below
index = index)
Max
Part of the issue for #1 is terminology. We usually think of the training and test sets as the initial splitting that is done when you have assembled and cleaned your data. Resampling only happens on the training set; the test set is left for a final, unbiased evaluation of the model once you have singled one out as being the best.
When resampling, I have been using different terminology for the data used in the model and for the data held-out for immediate prediction. I call those the analysis and assessment sets respectively. So for simple 10-fold CV, each analysis set is 90% of the training set and the assessment set is 10%.
Your point about inefficient use of data with a training and test set is one complaint that I've heard over the years. However, it is good scientific practice to have a confirmatory data set that is only used to reaffirm the results that you obtained during the modeling process. There are ways to do resampling incorrectly and you would not know that this has occurred until you evaluate the next set of samples (that were not involved in the preceding analysis). Your point is valid but, unless your entire data set is pathologically small, the use of a test set far outweighs the inefficiency caused by the smaller training set.
For #2, the only way to really know when you are overfitting is with a separate data set (such as the assessment set). Whether that comes from nested resampling or non-nested (please don't call it the caret
method), using the model to predict other samples is the only way to tell.
For #3, the process that I generally give to people is to do an initial training/test split, then resample the training set (using the same analysis/assessment splits across all testing). I generally use non-nested resampling (I'm the one who wrote caret
) but nested sampling can be used too (more on that below). Executing the resampling process across different tuning parameters can be very effective at helping choose parameter values since overfitting is reflected in those statistics. Once you've settled on parameter values, the final model is refit on the entire training set.
Think of the process like this: the model-related operations are a module and this module can be applied to any data set. Resampling is a methods of estimating the performance of that module and was invented to emulate what the results would be for the module fit on the entire training set. Even though resampling can sometimes used less data when the module is repeatedly evaluated, it is still a good estimator of the final model that uses all the training data.
The documentation for the rsample
package shows this at a more nuts-and-bolts level. For example, this page shows a neural network being tuned across epochs using simple 10-fold CV. In that example, you can see that the assessment sets (which would capture the effect of overfitting) are used to measure performance.
About nesting versus non-nesting: the main worry in non-nested resampling is optimization bias. If we evaluate a large number of tuning parameter values, there is some bias that we get by just choosing the best value. We are likely to be optimistic in our performance estimate. That is a real pattern and it is shown nicely in the papers that discuss it.
However... my experience is that, although real, this bias is very small in most cases (especially when compared to the experimental noise). I have yet to see a real data set when the non-nested resampling gave pathologically optimistic estimates. This vignette has a simulated case-study using rsample
that is a good demonstration. If the cost of nested resampling were not so high, I would definitely be using it more often.
Best Answer
Resampling isn't the same as tuning. It gives you an estimated performance metric. When you tune over a hyperparameter subspace you actually perform your resampling strategy over each evaluation point of that subspace.
So, in
caret
, when you resample a linear regression without tuning you get an estimate of the performance metric of your choice, and the regression coefficients are estimated over the training data of each resampling instance.