Im having a few issues with understanding cross validation and how to apply this to produce predictive models.
Im am currently working on Neural Networks (in r) and i'm using the caret package to perform a k-fold cross validation (k=10) on the train portion of my dataset (70% of entire dataset).
I have three questions:
1) Can i use Cross validation to determine the best hyperparametres of a model?
2) Once the hyperparametres are selected using the ROC value would i then need to test this final model against a test set (my remaining 30% unseen data) or is this the whole point of cross validation in that i do not need a separate test set?
3) If i do need to test the final model, how exactly do i do this. For example if the cross validation determines that a neural network with size=10 and decay =0.01 has the highest ROC value, would i rewrite this model and run it with all the train data and test this model on my test set to determine its true predictive power on unseen data?
Best Answer
The approach that I learned (from this Coursera course) is that you divide your dataset into 3 subsets: training, testing, and validation. I think a 60-20-20 ratio between the datasets is common. The two central rules are (a) never use the test set to directly fit a model, and (b) only touch the validation set once, at the very end of the process and after the final model has been chosen.
The training set is used to fit models. This is done using cross-validation:
caret::train
is designed to conduct steps 1 and 2. Use themetric
argument to identify the target statistic (andmaximize
to tell the CV algorithm whether to maximize or mining it). UsetuneGrid
andtuneLength
to set the range of hyperparameter values to search over.trControl
lets use fine-tune the process of using CV to select the optimal model. I don't think ROC is one of the pre-defined options formetric
, so you'll need to pass thetrControl
argument atrainControl
object, with thesummaryFunction
argument oftrainControl
a ROC function.The test set is used iteratively with the training set to refine the model.
caret
has three prediction functions to use with this step:extractPrediction
,extractProb
, andpredict
. Functions such asplotObsVsPred
andconfusionMatrix
can be used to identify problem cases.This iterative process can introduce overfitting, which can bias your estimates of how your model performs on totally new data. The test set isn't used directly to fit the model, but the iterative process means that you are selecting the model that best fits the test set. The validation set helps avoid that problem.
AFAIK
caret
doesn't have specific functions for this final step. But the same functions used with step 3 are useful here.