Solved – Comparing classification algorithms using cross validation and caret’s train

algorithmscaretclassificationcross-validationmodel comparison

I am having issues understanding some concepts of algorithm comparison/parameter optimization/cross-validation in R

Let's say I want to compare two classification algorithms, such as Random Forests and kNN. I have my data, and first I want to train my algorithms with R's train function from the caret package to find some optimal parameters (so, mtry for RF's and k for kNN, for example).

After that, I want to use those best parameters and estimate my algorithms' accuracies in order to compare them.

Am I supposed to:

A) Train using the full dataset, then perform k-fold cross validation by creating k models (using equal folds for both the algorithms) and averaging the k accuracies?

B) Split first into training and test data (let's say 75/25), train using training data, create one model for each algorithm using training data and then checking its accuracy with the test data

C) Something else which is probably correct

I know I am probably mixing things up and there probably is a correct way of doing this, what is it?

Best Answer

What you need to do is called "nested cross validation". These questions/answers in CV deal with it:

Use of nested cross-validation

Nested cross validation for model selection

How to split the dataset for cross validation, learning curve, and final evaluation?

and others. Just search for "nested cross validation"