The syntax for cv.glm
is clouding the issue here.
In general, one divides the data up into $k$ folds. The first fold is used as the test data, while the remaining $k-1$ folds are used to build the model. We evaluate the model's performance on the first fold and record it. This process is repeated until each fold is used once as test data and $k-1$ times as training data. There's no need to fit a model to the entire data set.
However, cv.glm
is a bit of a special case. If you look at its the documentation for cv.glm
, you do need to fit an entire model first. Here's the example at the very end of the help text:
require('boot')
data(mammals, package="MASS")
mammals.glm <- glm(log(brain) ~ log(body), data = mammals)
(cv.err <- cv.glm(mammals, mammals.glm)$delta)
(cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta)
The 4th line does a leave-one-out validation (each fold contains one example), while the last line performs a 6-fold cross-validation.
This sounds problematic: using the same data for training and testing is a sure-fire way to bias your results, but it is actually okay. If you look at the source (in bootfuns.q
, starting at line 811), the overall model is not used for prediction. The cross-validiation code just extracts the formula and other fitting options from the model object and reuses those for the cross-validation, which is fine* and then cross-validation is done in the normal leave-a-fold-out sort of way.
It outputs a list and the delta
component contains two estimates of the cross-validated prediction error. The first is the raw prediction error (according to your cost function or the average squared error function if you didn't provide one) and the second attempts to adjust to reduce the bias from not doing leave-one-out-validation instead. The help text has a citation, if you care about why/how. These are the values I would report in my manuscript/thesis/email-to-the-boss and what I would use to build an ROC curve.
* I say fine, but it
is annoying to fit an entire model just to initialize something. You might think you could do something clever like
my.model <- glm(log(brain) ~ log(body), data=mammals[1:5, :])
cv.err <- cv.glm(mammals, my.model)$delta
but it doesn't actually work because it uses the $y$ values from the overall model instead of the data
argument to cv.glm
, which is silly. The entire function is less than fifty lines, so you could also just roll your own, I guess.
The approach that I learned (from this Coursera course) is that you divide your dataset into 3 subsets: training, testing, and validation. I think a 60-20-20 ratio between the datasets is common. The two central rules are (a) never use the test set to directly fit a model, and (b) only touch the validation set once, at the very end of the process and after the final model has been chosen.
The training set is used to fit models. This is done using cross-validation:
- Select a range of hyperparameter values and a target statistic.
- Fit the model using CV at each hyperparameter value, optimizing the target statistic. Select the model (hyperparameter value) that optimizes the target statistic.
caret::train
is designed to conduct steps 1 and 2. Use the metric
argument to identify the target statistic (and maximize
to tell the CV algorithm whether to maximize or mining it). Use tuneGrid
and tuneLength
to set the range of hyperparameter values to search over. trControl
lets use fine-tune the process of using CV to select the optimal model. I don't think ROC is one of the pre-defined options for metric
, so you'll need to pass the trControl
argument a trainControl
object, with the summaryFunction
argument of trainControl
a ROC function.
The test set is used iteratively with the training set to refine the model.
- Use the selected model from step 2 to generate predictions on the test set. If the target statistic comes out too low, repeat steps 1 and 2 to fine-tune your model. If the target statistic is sufficiently high, you have your final model.
caret
has three prediction functions to use with this step: extractPrediction
, extractProb
, and predict
. Functions such as plotObsVsPred
and confusionMatrix
can be used to identify problem cases.
This iterative process can introduce overfitting, which can bias your estimates of how your model performs on totally new data. The test set isn't used directly to fit the model, but the iterative process means that you are selecting the model that best fits the test set. The validation set helps avoid that problem.
- Only do this step once, at the very end of your analysis. Use the final model from step 3 to generate predictions on the validation set. Report the results as your out-of-sample accuracy/error estimates.
AFAIK caret
doesn't have specific functions for this final step. But the same functions used with step 3 are useful here.
Best Answer
Yes
No
The test set should be handled independently of the training set so you could do a separate CV block for the test set if you really wish, and may provide some useful insight but is not universal practice. CV may be useful if you plan to apply the model to a completely new set of ‘real world’ data. Given that the test set is drawn from the same population as the training set this may not be that useful as you would expect it to have similar charcateristics to the training set if the split was performed correctly and without bias. Mind you, may be worth checking this assumption.
What is the purpose of CV?
This is not the purpose of CV, rather it is to estimate the robustness of your performance metrics. As @user86895 states it does not measure MSE, see Mean squared error versus Least squared error, which one to compare datasets? for further reading. CV creates multiple models on subsets of the data and applies them to the data withheld from that subset. It iterates over the dataset, building new models until all have been included in training subsets and all have been included in test subsets. The final model is built on all the training set not any of the individual CV round models, the purpose of CV is not to build models but to assess stability of the model performance, i.e. how generalisable the model is.
When comparing different data processing or analysis algorithms on a dataset it provides a first filter to identify the work pathways that provide the most stable models. It does this by providing estimates of how variable the performance is between sub-sets of your training set. This allows you to detect models with a very high risk of overfitting and filter them out. Without cross validation you would be picking based solely on the maximum performance without concern to its stability. But when you come to apply a model in a deployed situation its stability (relevance across the real world population) will be more important than moderate differences in raw performance on a subset of curated samples (i.e you original experimental set).
Cross validation is in fact essential for choosing the crudest parameters for a model such as number of components in PCA or PLS using the Q2 statistic (which is R2 but on the held out data, see What is the Q² value for each component of a PCA) to determine when overfitting starts to degrade model performance.
I am taking this to mean 'how can I use CV result to estimate performance beyond my experimental set?', but will update this section of my answer if it is clarified differently.
CV is used as a first line estimate of model stability, not to estimate performance in real world settings. The only way to do this is to test the final model in a real-world situation. What CV does is provide you a risk analysis, if it appears stable then you could decide it is time to risk the model on a real-world test. If it is not stable then you need to probably expand your training set considerably (ensuring an even representation of important sub groups and confounding factors as these are one source, other than random noise, of overfitting as all relevant variation needs to be given an equal exposure to the model building process to be properly weighted for) and build a new model.
And a note on real world validation, if it works it doesn’t prove your model is generalisable, only that it works under the specific mechnisms whereby it has been deployed in the real-world.