Solved – Cross validation with test data set

cross-validation

I am a bit confused about the application of cross-validation. So, if I have a big data set, I will split my data into test and training data and and perform validation on the test data. But if I have a small data set I would like to used cross-validation and then the validation is already performed within it.

What puzzles me is that lots of people split data, perform training on training data with cross-validation, and then perform validation on the test dataset. So they combine those two methods. Is this a proper way to do it? May I do only cross-validation since my data set is quite small?

Best Answer

Let's look at three different approaches

  1. In the simplest scenario one would collect one dataset and train your model via cross-validation to create your best model. Then you would collect another completely independent dataset and test your model. However, this scenario is not possible for many researchers given time or cost limitations.

  2. If you have a sufficiently large dataset, you would want to take a split of your data and leave it to the side (completely untouched by the training). This is to simulate it as a completely independent dataset set even though it comes from the same dataset but the model training won't take any information from those samples. You would then build your model on the remaining training samples and then test on these left-out samples.

  3. If you have a smaller dataset, you may not be able to afford to simply ignore a chunk of your data for model building. As such, the validation is performed on every fold (k-fold CV?) and your validation metric would be aggregated across each validation.

To more directly answer your question, yes you can just do cross-validation on your full dataset. You can then use your predicted and actual classes to evaluate your models performance by whatever metric you prefer (Accuracy, AUC, etc.)

That said, you still probably want to look in to repeated cross-validation to evaluate the stability of your model. Some good answers regarding this are here on internal vs. external CV and here on the # of repeats