Solved – Cross validation and predicting models in R

cross-validation

Im having a few issues with understanding cross validation and how to apply this to produce predictive models.

Im am currently working on Neural Networks (in r) and i'm using the caret package to perform a k-fold cross validation (k=10) on the train portion of my dataset (70% of entire dataset).
I have three questions:

1) Can i use Cross validation to determine the best hyperparametres of a model?

2) Once the hyperparametres are selected using the ROC value would i then need to test this final model against a test set (my remaining 30% unseen data) or is this the whole point of cross validation in that i do not need a separate test set?

3) If i do need to test the final model, how exactly do i do this. For example if the cross validation determines that a neural network with size=10 and decay =0.01 has the highest ROC value, would i rewrite this model and run it with all the train data and test this model on my test set to determine its true predictive power on unseen data?

Best Answer

The approach that I learned (from this Coursera course) is that you divide your dataset into 3 subsets: training, testing, and validation. I think a 60-20-20 ratio between the datasets is common. The two central rules are (a) never use the test set to directly fit a model, and (b) only touch the validation set once, at the very end of the process and after the final model has been chosen.

The training set is used to fit models. This is done using cross-validation:

  1. Select a range of hyperparameter values and a target statistic.
  2. Fit the model using CV at each hyperparameter value, optimizing the target statistic. Select the model (hyperparameter value) that optimizes the target statistic.

caret::train is designed to conduct steps 1 and 2. Use the metric argument to identify the target statistic (and maximize to tell the CV algorithm whether to maximize or mining it). Use tuneGrid and tuneLength to set the range of hyperparameter values to search over. trControl lets use fine-tune the process of using CV to select the optimal model. I don't think ROC is one of the pre-defined options for metric, so you'll need to pass the trControl argument a trainControl object, with the summaryFunction argument of trainControl a ROC function.

The test set is used iteratively with the training set to refine the model.

  1. Use the selected model from step 2 to generate predictions on the test set. If the target statistic comes out too low, repeat steps 1 and 2 to fine-tune your model. If the target statistic is sufficiently high, you have your final model.

caret has three prediction functions to use with this step: extractPrediction, extractProb, and predict. Functions such as plotObsVsPred and confusionMatrix can be used to identify problem cases.

This iterative process can introduce overfitting, which can bias your estimates of how your model performs on totally new data. The test set isn't used directly to fit the model, but the iterative process means that you are selecting the model that best fits the test set. The validation set helps avoid that problem.

  1. Only do this step once, at the very end of your analysis. Use the final model from step 3 to generate predictions on the validation set. Report the results as your out-of-sample accuracy/error estimates.

AFAIK caret doesn't have specific functions for this final step. But the same functions used with step 3 are useful here.

Related Question