Solved – Explain “validation” process of repeated k-fold cross-validation

caretcross-validationpredictionpredictive-models

My understanding is currently that the canonical repeated k-fold cross-validation (CV) process might do the following if $n=100$ observations in sample, $k=5$ folds, $i= 10$ iterations (see iteration 1 here), our $model=$ linear regression, and our interest is prediction:

  • Validation using $n=100$:
    • (i1) random $n_{train}=80$ (4/5 ((k-1)/k) * 100) used by OLS to estimate the true values of p parameters; then validate p parameters on other $n_{validate}=20$;
      • For k1,.. k5: OLS gives $\hat\beta{x_1},.. \hat\beta{x_p}$, $R^2_{train}$ on $n_{train}=80$, and $R^2_{validate}$ on $n_{validate}=20$.
      • But what does CV do with this "folds" matrix of $k=5$ rows?
    • repeat (i2,.. i10)
      • For i1,.. i10: OLS gives $\hat\beta{x_1},.. \hat\beta{x_p}$, $R^2_{train}$ on $n_{train}=80$, and $R^2_{validate}$ on $n_{validate}=20$.
      • *But what does CV do with this "folds * iterations" matrix of $k*i=50$ rows?*
  • Prediction:
    • Which parameter estimates, from our 50 rows, does CV use to make $\hat y$ on our $n=100$ observations in sample?

Part 2:
The caret package in R allows the following if $(n,k,i,model)$ same as above and $train= 0.75$:

  • training: $n_{train}=$ random 75
  • test: $n_{test}=$ other 25
  • Validation using $n_{train}=$ random 75:
    • Perform canonical process above for i iterations on $n_{train}=$ random 75 (thus 4/5 = $n_{train}=60$, and 1/5 = $n_{validate}=15$);
    • Create matrix as above and do something with this matrix.
  • But what does CV then do with $n_{test}=25$ that remains unused?

*Note we could replace $R^2$ with $RMSE$ for default caret behavior on regressions.

Best Answer

Typically, cross validation uses the average performance (not weighted) as result. Cross validation in itself does not do any selection. One main idea behind cross validation is to reduce variance by averaging over more tests - selecting does the opposite and would lead to increased variance.

So no particular model or iteration is chosen, the $i k$ models built during the cross validation process are just seen as surrogates to the "real" model which is trained separately on all $n$ cases. The $i k$ sets of model parameters are usually discarded.

It is possible to make use of the parameter sets, though:

  • to validate the model parameters in addition to the predictive ability, you can have a look how stable the parameter estimates are.
    One of the implicit assumptions of cross validation is that because the training sets are very similar to each other and to the whole data set (differing only by $\frac{n}{k}$ to $\frac{2 n}{k}$ out of the $n$ cases), if the model building process is stable, then so should the parameters be*.
  • if you find the models suffer from instability (particularly in the predictions), you may go one step further and use the cross validation surrogate models as ensemble for an aggregated model, similar to what bagging does with bootstrap resampling with replacement.

Note that cross validation procedures used for model parameter estimation rather than for estimating the predictive abilities (= testing, validation) are known as jackknifing (in the narrow sense that would be leave-one-out resampling for parameter estimation).

Some confusion may come from the fact that the cross validation results can then be used for two different things:

  • as performance estimate of the "real" model built on the whole data set using the same training procedure, or
  • to select among a number of different models, e.g. in hyperparameter estimation. In this case, the selected model must undergo an independent validation to exclude the possibility that the observed good performance was just due to chance (variance). Thus, for this set-up you need either a so-called double or nested cross validation or a still unknown test set.

*For some applications and models, the situation may be more difficult, as collinearities can lead to instability in the coefficients that does not necessarily affect the stability of the predictions.


update: answering questions in the comment

In the example above, you'd get 50 estimates of $R^2$, one for each of the 50 surrogate models, yes. And yes, they are assumed to be a good approximation for the $R^2$ of "the" (one) model built using the whole population.

Side note: so far, I've sen $R^2$ for goodness of fit only, i.e. calculated explicitly on training data only. Doing this in a cross validation does yield information, but possibly not the information you're after.
You could construct a predictive $R^2$, but typically I've seen residual sum of squares for out of training cases (e.g. $PRESS_{CV}$) instead of % unexplained variance for prediction.


caret: I cannot say for sure - I don't use caret as I have hierarchical data structures and need to take some rather special care in splitting training and test data.

It could be that ntest is set aside for nested validation, or that it is a parameter that is used if hold out validation is done instead of cross validation. (In the page you linked, I did not see an explicit explanation on a quick glance)

Look up the code, dude.


jacknife vs. cross validation: no, there is no difference in the calculations -- it just happened to end up with a second name.