Solved – k-fold cross-validation for large data sets

cross-validation

I am performing 5-fold cross-validation on a relatively large data set and I have noticed that the validation error for each of the 5 training sets are very similar. So I guess, in this case, cross-validation is not very useful (it would be about the same as just using one training and test set). So I was wondering if I am working with a special case, or is this the case for all large data sets. I'm thinking that perhaps if you have enough training examples, the average cross-validation score would not be very different than the score for one training and test set. Is this intuitition correct?

Best Answer

It is certainly adds value to a single test because you get a stronger justification that your estimated accuracy is correct.

Large dataset certainly helps in making robust, accurate models though it won't bias the cross-validation on its own. The only possible problem you should check for is whether the set contains significant fraction of duplicated objects -- this may happen if the number of attributes is very small in comparison.