My understanding is currently that the canonical repeated k-fold cross-validation (CV) process might do the following if $n=100$ observations in sample, $k=5$ folds, $i= 10$ iterations (see iteration 1 here), our $model=$ linear regression, and our interest is prediction:
- Validation using $n=100$:
- (i1) random $n_{train}=80$ (4/5 ((k-1)/k) * 100) used by OLS to estimate the true values of p parameters; then validate p parameters on other $n_{validate}=20$;
- For k1,.. k5: OLS gives $\hat\beta{x_1},.. \hat\beta{x_p}$, $R^2_{train}$ on $n_{train}=80$, and $R^2_{validate}$ on $n_{validate}=20$.
- But what does CV do with this "folds" matrix of $k=5$ rows?
- repeat (i2,.. i10)
- For i1,.. i10: OLS gives $\hat\beta{x_1},.. \hat\beta{x_p}$, $R^2_{train}$ on $n_{train}=80$, and $R^2_{validate}$ on $n_{validate}=20$.
- *But what does CV do with this "folds * iterations" matrix of $k*i=50$ rows?*
- (i1) random $n_{train}=80$ (4/5 ((k-1)/k) * 100) used by OLS to estimate the true values of p parameters; then validate p parameters on other $n_{validate}=20$;
- Prediction:
- Which parameter estimates, from our 50 rows, does CV use to make $\hat y$ on our $n=100$ observations in sample?
Part 2:
The caret package in R allows the following if $(n,k,i,model)$ same as above and $train= 0.75$:
- training: $n_{train}=$ random 75
- test: $n_{test}=$ other 25
- Validation using $n_{train}=$ random 75:
- Perform canonical process above for i iterations on $n_{train}=$ random 75 (thus 4/5 = $n_{train}=60$, and 1/5 = $n_{validate}=15$);
- Create matrix as above and do something with this matrix.
- But what does CV then do with $n_{test}=25$ that remains unused?
*Note we could replace $R^2$ with $RMSE$ for default caret behavior on regressions.
Best Answer
Typically, cross validation uses the average performance (not weighted) as result. Cross validation in itself does not do any selection. One main idea behind cross validation is to reduce variance by averaging over more tests - selecting does the opposite and would lead to increased variance.
So no particular model or iteration is chosen, the $i k$ models built during the cross validation process are just seen as surrogates to the "real" model which is trained separately on all $n$ cases. The $i k$ sets of model parameters are usually discarded.
It is possible to make use of the parameter sets, though:
One of the implicit assumptions of cross validation is that because the training sets are very similar to each other and to the whole data set (differing only by $\frac{n}{k}$ to $\frac{2 n}{k}$ out of the $n$ cases), if the model building process is stable, then so should the parameters be*.
Note that cross validation procedures used for model parameter estimation rather than for estimating the predictive abilities (= testing, validation) are known as jackknifing (in the narrow sense that would be leave-one-out resampling for parameter estimation).
Some confusion may come from the fact that the cross validation results can then be used for two different things:
*For some applications and models, the situation may be more difficult, as collinearities can lead to instability in the coefficients that does not necessarily affect the stability of the predictions.
update: answering questions in the comment
In the example above, you'd get 50 estimates of $R^2$, one for each of the 50 surrogate models, yes. And yes, they are assumed to be a good approximation for the $R^2$ of "the" (one) model built using the whole population.
Side note: so far, I've sen $R^2$ for goodness of fit only, i.e. calculated explicitly on training data only. Doing this in a cross validation does yield information, but possibly not the information you're after.
You could construct a predictive $R^2$, but typically I've seen residual sum of squares for out of training cases (e.g. $PRESS_{CV}$) instead of % unexplained variance for prediction.
caret: I cannot say for sure - I don't use
caret
as I have hierarchical data structures and need to take some rather special care in splitting training and test data.It could be that
ntest
is set aside for nested validation, or that it is a parameter that is used if hold out validation is done instead of cross validation. (In the page you linked, I did not see an explicit explanation on a quick glance)Look up the code, dude.
jacknife vs. cross validation: no, there is no difference in the calculations -- it just happened to end up with a second name.