Solved – Predicting with cross validation

cross-validationnaive bayes

I want to predict labels via naive bayes and cross validation and measure the test accuracy. I do understand the principle of cross validation but not completely how to apply it.

My question: Do I have to train and test the model on the whole dataset or do I have to split in test and training set although I use cross-validation?

E.g. either this

train_control <- trainControl(method="cv",savePred=TRUE)
# Use the first 2000 samples to train
naiveModel <- train(as.factor(label)~variable1+variable2+variable3,data[1:2000,], trControl=train_control, method="nb")
# Use the last 400 samples to test
naivePrediction <- predict(naiveModel, data[2000:2400,])
postResample(naivePrediction, as.factor(data[2000:2400,4]))

or that

train_control <- trainControl(method="cv",savePred=TRUE)
# Use the whole dataset to train
naiveModel <- train(as.factor(label)~variable1+variable2+variable3,data, trControl=train_control, method="nb")
# Use the whole dataset to test
naivePrediction <- predict(naiveModel, data)
postResample(naivePrediction, as.factor(data[,4]))

Best Answer

You typically implement cross-validation to avoid splitting your data set which would reduce the number of observations used in your training set. Thus, this technique is very powerful when you are working with a limited data set relative to the number of variables in your system.

However, it should be noted that it does not provide as strong of an argument as would a true validation set. If you have a large amount of data compared to your variables, reserving a validation set will give you more confidence that your model adequately captures the feature of interest against the noise.

All of these are tools to understand the generality and performance of your final model and how it will perform when faced with an unknown sample. When your data is limited, use cross-validation (your second code snippet). When your data set is large, cross-validate, but also reserve a true validation set to have more confidence in your model (your first).

Related Question