Solved – Question about performing k-fold CV with caret

caretcross-validationmodel comparison

I have read the help manual of caret carefully: see A Short Introduction to the caret Package.

In its example, I found it split the data with createDataPartition before a model training.

library(caret)
library(mlbench)
data(Sonar)
set.seed(107)
inTrain <- creatDataPartition(y = Sonar$Class, p = .75, list = FALSE)
str(inTrain)
training <- Sonar[inTrain,]
testing <- Sonar[-inTrain,]

And does it make the repeated 10-fold CV with only the 75% data in the training set?

ctrl <- trainControl(method = "repeatedcv", repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)
plsFit <- train(Class~., data = training, method = "pls", tuneLength = 15, trControl = ctrl, metric = "ROC", preProc =  c("center", "scale"))
plsFit

plsClass <- predict(plsFit, newdata = testing)

So, I think a k-fold CV make the "k-1" data into the training set and "1" data into the testing set, and all sets of data should be put into the testing data set at least once?

The code above uses only the 75% data in the whole model training process and the 10-fold CV process, and the rest (25%) only in the last prediction process.
Do I understood these codes correctly?

I think the process in the caret manuals is different from the principal idea of k-fold CV, which really confused me.

And I want to perform 10-fold CV for model comparison, including rpart, adaboost, bagging, svm and kknn. Can the functions in caret make sense?
How can I do it?

Best Answer

The first bit:

inTrain <- creatDataPartition(y = Sonar$Class, p = .75, list = FALSE)

splits the totality of your data into training (75%) and test (25%).

Usually, cross-validation and other resampling methods are used on the training set. So if you use caret's train function for example and ask for 10-fold CV, the 10% held back is 10% of the training set.

So, I think a k-fold CV make the "k-1" data into the training set and "1" data into the testing set,

Sort of (the terminology is getting in the way). During each iteration of 10 fold CV, 90% is used to fit the model and 10% is held out for prediction. Again, these are percentages of your training set.

Can the functions in caret make sense?

You will have to clarify this part. train can fit many different models.

Related Solutions

Solved – Custom resampling method in caret

So, if you had 8 training set samples would this scheme result in choose(8,2) = 28 resamples? Also, I'm assuming that this isn't two nested leave-one-out loops.

If so, here is a solution that might breakdown with large sample sizes.

num_samps <- 8
holdout <- combn(num_samps, 2)
in_training <- apply(holdout, 2, 
                     function(x, all) all[!(all %in% x)],
                     all = 1:num_samps)

## need a more effcient way of doing this:
index <- vector(mode = "list", length = ncol(in_training))
for(i in 1:ncol(in_training)) index[[i]] <- in_training[,i]
## cosmetic:
names(index) <- caret:::prettySeq(seq(along = index))

ctrl <- trainControl(method = "cv", ## this will be ignored since  
                     ## we supply index below
                     index = index)

Max

Solved – Caret – Repeated K-fold cross-validation vs Nested K-fold cross validation, repeated n-times

There's nothing wrong with the (nested) algorithm presented, and in fact, it would likely perform well with decent robustness for the bias-variance problem on different data sets. You never said, however, that the reader should assume the features you were using are the most "optimal", so if that's unknown, there are some feature selection issues that must first be addressed.

FEATURE/PARAMETER SELECTION

A lesser biased approached is to never let the classifier/model come close to anything remotely related to feature/parameter selection, since you don't want the fox (classifier, model) to be the guard of the chickens (features, parameters). Your feature (parameter) selection method is a $wrapper$ - where feature selection is bundled inside iterative learning performed by the classifier/model. On the contrary, I always use a feature $filter$ that employs a different method which is far-removed from the classifier/model, as an attempt to minimize feature (parameter) selection bias. Look up wrapping vs filtering and selection bias during feature selection (G.J. McLachlan).

There is always a major feature selection problem, for which the solution is to invoke a method of object partitioning (folds), in which the objects are partitioned in to different sets. For example, simulate a data matrix with 100 rows and 100 columns, and then simulate a binary variate (0,1) in another column -- call this the grouping variable. Next, run t-tests on each column using the binary (0,1) variable as the grouping variable. Several of the 100 t-tests will be significant by chance alone; however, as soon as you split the data matrix into two folds $\mathcal{D}_1$ and $\mathcal{D}_2$, each of which has $n=50$, the number of significant tests drops down. Until you can solve this problem with your data by determining the optimal number of folds to use during parameter selection, your results may be suspect. So you'll need to establish some sort of bootstrap-bias method for evaluating predictive accuracy on the hold-out objects as a function of varying sample sizes used in each training fold, e.g., $\pi=0.1n, 0.2n, 0,3n, 0.4n, 0.5n$ (that is, increasing sample sizes used during learning) combined with a varying number of CV folds used, e.g., 2, 5, 10, etc.

OPTIMIZATION/MINIMIZATION

You seem to really be solving an optimization or minimization problem for function approximation e.g., $y=f(x_1, x_2, \ldots, x_j)$, where e.g. regression or a predictive model with parameters is used and $y$ is continuously-scaled. Given this, and given the need to minimize bias in your predictions (selection bias, bias-variance, information leakage from testing objects into training objects, etc.) you might look into use of employing CV during use of swarm intelligence methods, such as particle swarm optimization(PSO), ant colony optimization, etc. PSO (see Kennedy & Eberhart, 1995) adds parameters for social and cultural information exchange among particles as they fly through the parameter space during learning. Once you become familiar with swarm intelligence methods, you'll see that you can overcome a lot of biases in parameter determination. Lastly, I don't know if there is a random forest (RF, see Breiman, Journ. of Machine Learning) approach for function approximation, but if there is, use of RF for function approximation would alleviate 95% of the issues you are facing.

Best Answer

Related Solutions

Solved – Custom resampling method in caret

Solved – Caret – Repeated K-fold cross-validation vs Nested K-fold cross validation, repeated n-times

Related Question