Solved – Splitting data into test/train set vs. using k-fold cross validation

classificationcross-validationmachine learningr

So, I am working on a binary classification problem (using R) and I am having some confusion on when/how to use data splitting and k-fold cv. I have about 50 labeled samples and I want to train various algorithms (SVM, KNN, NB) to make predictions on new data.

My question is: do I need to split my data into a train and test set and perform k-fold cv? To me it seems that if you just perform k-fold cv without a data split then you are training on the test data. However in my research on this topic I find people saying that you can use one or the other, or both. How could you use just k-fold cv without a data split?

Here is example code of my three approaches, could someone please explain to me which is the appropriate choice? I think I may have a fundamental misunderstanding on how cross validation works.

Approach 1:

# load the package
library(caret)

# load the iris dataset
data(iris)

# define training control
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)

# evaluate the model
fit <- train(Species~., data=iris, trControl=trainControl, method="nb")

# display the results
print(fit)

Approach 2

# load the packages
library(caret)
library(klaR)

# load the iris dataset
data(iris)

# define an 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(iris$Species, p=0.80, list=FALSE)
dataTrain <- iris[ trainIndex,]
dataTest <- iris[-trainIndex,]

# train a naive Bayes model
fit <- NaiveBayes(Species~., data=dataTrain)

# make predictions
predictions <- predict(fit, dataTest[,1:4])

# summarize results
confusionMatrix(predictions$class, dataTest$Species)

Approach 3:

# load the packages
library(caret)
library(klaR)

# load the iris dataset
data(iris)

# define an 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(iris$Species, p=0.80, list=FALSE)
dataTrain <- iris[ trainIndex,]
dataTest <- iris[-trainIndex,]

# define training control
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)

# evaluate the model
fit <- train(Species~., data=dataTrain, trControl=trainControl, method="nb")

# make predictions
predictions <- predict(fit, dataTest[,1:4])

# summarize results
confusionMatrix(predictions$class, dataTest$Species)

Best Answer

One usually:

  1. Splits data into train and test sets.
  2. Stashes the test set until the very-very-very last moment.
  3. Trains models with k-fold CV or bootstrapping (it's very useful tool too)
  4. When all the models tuned and one observes some good results, one takes the stashed test set and observes the real state of the things

The idea is to keep the test set aside to prevent your models to train on it, so the models wont be able to "remember" the samples (read it as overfitting) and show you better results than it should be.