Solved – Do we have to fix splits before 10-folds cross validation if we want to compare different algorithms

caretcross-validatione1071r

I work with R and let's say that I have a train set and a test set. I want to test different algorithms (for example neural networks and svm).

I will perform a first 10-folds cross validation on my train set for tuning the neural network.

Then I will perform a 10-folds cross validation on my train set for tuning the svm.

And I will compare the performances of each best model on my test set.

I was wondering if it was theoretically an issue that the 10-folds (randomly built) were not the same in the tuning of both algorithms.

I think that this should not be a problem because the result of the tuning should be robust to the choice of the folds. But it is apparantly not the case (I've read that about knn, with tune.knn of e1071 package in R).

If we have to fix the splits before tuning, do you know how to do so in R ? I haven't found the right option for the tune function of e1071 package.

Is caret a better package relatively to this point ? Since it seems possible to repeat 10-folds cross validation when tuning, I think that might make the results of the tuning more robust and the comparison of different models more legitimate.

Thanks for your insight

Best Answer

The results will be sensitive to the splits, so you should compare models on the same partitioning of the data. Compare these two approaches:

  1. Approach 1 will compare two models, but use the same CV partitioning.
  2. Approach 2 will compare two models, but the first model will have a different CV partitioning than second.

We'd like to select the best model. The problem with approach 2 is that the difference in performance between the two models will come from two different sources: (a) the differences between the two folds and (b) the differences between the algorithms themselves (say, random forest and logistic regression). If one model out-performs the other, we won't know if that difference in performance is entirely, partially, or not at all due to the differences in the two CV partitions. On the other hand, any difference in performance using approach 1 cannot be due to differences in how the data were partitioned, because the partitions are identical.

To fix the partioning, use cvTools to create your (repeated) CV partitions and store the results.