Solved – Comparing classification algorithms using cross validation and caret’s train

algorithmscaretclassificationcross-validationmodel comparison

I am having issues understanding some concepts of algorithm comparison/parameter optimization/cross-validation in R

Let's say I want to compare two classification algorithms, such as Random Forests and kNN. I have my data, and first I want to train my algorithms with R's train function from the caret package to find some optimal parameters (so, mtry for RF's and k for kNN, for example).

After that, I want to use those best parameters and estimate my algorithms' accuracies in order to compare them.

Am I supposed to:

A) Train using the full dataset, then perform k-fold cross validation by creating k models (using equal folds for both the algorithms) and averaging the k accuracies?

B) Split first into training and test data (let's say 75/25), train using training data, create one model for each algorithm using training data and then checking its accuracy with the test data

C) Something else which is probably correct

I know I am probably mixing things up and there probably is a correct way of doing this, what is it?

Best Answer

What you need to do is called "nested cross validation". These questions/answers in CV deal with it:

Use of nested cross-validation

Nested cross validation for model selection

How to split the dataset for cross validation, learning curve, and final evaluation?

and others. Just search for "nested cross validation"

Best Answer

Related Solutions

Solved – Do we have to fix splits before 10-folds cross validation if we want to compare different algorithms

Solved – Caret – Repeated K-fold cross-validation vs Nested K-fold cross validation, repeated n-times

Related Question