Yes, the approach you are using is correct. As @halilpazarlama suggests the reason your cross-validation error is lower than the test error is indeed likely to be because you are over-fitting the cross-validation error. Essentially the cross-validation error is a performance estimator based on a finite same of data, and thus will have a (usually) non-negligible variance (i.e. if you ran the same experiment again with a different sample of data, you would get a slightly different minimum error and the minimum may well occur at different grid-point). Thus we can minimise the cross-validation error in two ways, ways that genuinely improve generalisation performance and ways that merely exploit the random sampling of the data to form the training data. I wrote a paper about this as it can be problematic if you have a large number of hyper-parameters to tune:
G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (pdf)
If you only have a few hyper-parameters to tune, this isn't usually too much of a problem, as long as you are aware that the minimum cross-validation error will be optimistically biased. If you need an unbiased (or at least less biased or pessimistically biased) estimate, then the thing to do is nested cross-validation, where the outer cross-validation estimates performance and the inner cross-validation is used to tune the hyper-parameters separately in each fold (see the paper for details). Basically tuning the hyper-parameters is part of the fitting of the model and needs to be cross-validated as well. Of course this is computationally even more expensive.
To reduce the computational expense, you could try using the Nelder-Mead simplex method for tuning the hyper-parameters instead of grid search, it is usually more efficient and doesn't need gradient information. Pattern search is another alternative. Another thing you can do to improve efficiency is to start the model for each grid point from the model found at the last grid point, instead of starting from scratch again (alpha seeding). Alternatively you could use a "regularisation path" type algorithm to learn each row (where the regularisation parameter is being changed with fixed kernel parameter) in one go.
Part of the issue for #1 is terminology. We usually think of the training and test sets as the initial splitting that is done when you have assembled and cleaned your data. Resampling only happens on the training set; the test set is left for a final, unbiased evaluation of the model once you have singled one out as being the best.
When resampling, I have been using different terminology for the data used in the model and for the data held-out for immediate prediction. I call those the analysis and assessment sets respectively. So for simple 10-fold CV, each analysis set is 90% of the training set and the assessment set is 10%.
Your point about inefficient use of data with a training and test set is one complaint that I've heard over the years. However, it is good scientific practice to have a confirmatory data set that is only used to reaffirm the results that you obtained during the modeling process. There are ways to do resampling incorrectly and you would not know that this has occurred until you evaluate the next set of samples (that were not involved in the preceding analysis). Your point is valid but, unless your entire data set is pathologically small, the use of a test set far outweighs the inefficiency caused by the smaller training set.
For #2, the only way to really know when you are overfitting is with a separate data set (such as the assessment set). Whether that comes from nested resampling or non-nested (please don't call it the caret
method), using the model to predict other samples is the only way to tell.
For #3, the process that I generally give to people is to do an initial training/test split, then resample the training set (using the same analysis/assessment splits across all testing). I generally use non-nested resampling (I'm the one who wrote caret
) but nested sampling can be used too (more on that below). Executing the resampling process across different tuning parameters can be very effective at helping choose parameter values since overfitting is reflected in those statistics. Once you've settled on parameter values, the final model is refit on the entire training set.
Think of the process like this: the model-related operations are a module and this module can be applied to any data set. Resampling is a methods of estimating the performance of that module and was invented to emulate what the results would be for the module fit on the entire training set. Even though resampling can sometimes used less data when the module is repeatedly evaluated, it is still a good estimator of the final model that uses all the training data.
The documentation for the rsample
package shows this at a more nuts-and-bolts level. For example, this page shows a neural network being tuned across epochs using simple 10-fold CV. In that example, you can see that the assessment sets (which would capture the effect of overfitting) are used to measure performance.
About nesting versus non-nesting: the main worry in non-nested resampling is optimization bias. If we evaluate a large number of tuning parameter values, there is some bias that we get by just choosing the best value. We are likely to be optimistic in our performance estimate. That is a real pattern and it is shown nicely in the papers that discuss it.
However... my experience is that, although real, this bias is very small in most cases (especially when compared to the experimental noise). I have yet to see a real data set when the non-nested resampling gave pathologically optimistic estimates. This vignette has a simulated case-study using rsample
that is a good demonstration. If the cost of nested resampling were not so high, I would definitely be using it more often.
Best Answer
Yes, you are correct. If you want to look at the details:
fit$results
withfit$bestTune
andfit$finalModel
(with same performance the less complex model is chosen).fit$resample
. Note that with changing the value forreturnResamp
in?trainControl
you can configure which results you see here (e.g. if you want to see those also for other than the finally selected parameter set) - but usually the default should be fine.savePredictions = T
in?trainControl
, then look atfit$pred
, e.g. astable(fit$pred$Resample)
.