Solved – Question about performing k-fold CV with caret

caretcross-validationmodel comparison

I have read the help manual of caret carefully: see A Short Introduction to the caret Package.

In its example, I found it split the data with createDataPartition before a model training.

library(caret)
library(mlbench)
data(Sonar)
set.seed(107)
inTrain <- creatDataPartition(y = Sonar$Class, p = .75, list = FALSE)
str(inTrain)
training <- Sonar[inTrain,]
testing <- Sonar[-inTrain,]

And does it make the repeated 10-fold CV with only the 75% data in the training set?

ctrl <- trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
plsFit <- train(Class~., data = training, method = "pls", tuneLength = 15, trControl = ctrl, metric = "ROC", preProc =  c("center", "scale"))
plsFit

plsClass <- predict(plsFit, newdata = testing)

So, I think a k-fold CV make the "k-1" data into the training set and "1" data into the testing set, and all sets of data should be put into the testing data set at least once?

The code above uses only the 75% data in the whole model training process and the 10-fold CV process, and the rest (25%) only in the last prediction process.
Do I understood these codes correctly?

I think the process in the caret manuals is different from the principal idea of k-fold CV, which really confused me.

And I want to perform 10-fold CV for model comparison, including rpart, adaboost, bagging, svm and kknn. Can the functions in caret make sense?
How can I do it?

Best Answer

The first bit:

inTrain <- creatDataPartition(y = Sonar$Class, p = .75, list = FALSE)

splits the totality of your data into training (75%) and test (25%).

Usually, cross-validation and other resampling methods are used on the training set. So if you use caret's train function for example and ask for 10-fold CV, the 10% held back is 10% of the training set.

So, I think a k-fold CV make the "k-1" data into the training set and "1" data into the testing set,

Sort of (the terminology is getting in the way). During each iteration of 10 fold CV, 90% is used to fit the model and 10% is held out for prediction. Again, these are percentages of your training set.

Can the functions in caret make sense?

You will have to clarify this part. train can fit many different models.

Related Question