Solved – How to choose the training, cross-validation, and test set sizes for small sample-size data

cross-validationmachine learningsample-sizesamplingsvm

Assume I have a small sample size, e.g. N=100, and two classes. How should I choose the training, cross-validation, and test set sizes for machine learning?

I would intuitively pick

Training set size as 50
Cross validation set size 25, and
Test size as 25.

But probably this makes more or less sense. How should I really decide these values? May I try different options (though I guess it is not so preferable… increased possibility of over learning)?

What if I had more than two classes?

Best Answer

You surely found the very similar question: Choice of K in K-fold cross-validation ?
(Including the link to Ron Kohavi's work)
If your sample size is already small I recommend avoiding any data driven optimization. Instead, restrict yourself to models where you can fix hyperparameters by your knowledge about model and application/data. This makes one of the validation/test levels unnecessary, leaving more of your few cases for training of the surrogate models in the remaining cross validation.
IMHO, you anyways cannot afford very fancy models with that sample size. And almost certainly you cannot afford to do any meaningful model comparisons (for sure not unless you use proper scoring rules and paired analysis techniques).
This decision is far more important than the precise choice of $k$ (say, 5-fold vs. 10-fold) - with the important exception that leave one out is not recommended in general.
Interestingly, with these very-small-sample-size classification problems, validation is often more difficult (in terms of sample size needs) compared to training of a decent model. If you need any literature on this, see e.g. our paper on sample size planning:
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323
Another important point is to make good use of the possibility to iterate/repeat the cross validation (which is one of the reasons against LOO): this allows you to measure the stability of the predictions against perturbations (i.e. few different cases) of the training data.

Literature:
- Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations Anal Bioanal Chem, 2008, 390, 1261-1271.
  DOI: 10.1007/s00216-007-1818-6
- Dixon, S. J.; Heinrich, N.; Holmboe, M.; Schaefer, M. L.; Reed, R. R.; Trevejo, J. & Brereton, R. G.: Application of classification methods when group sizes are unequal by incorporation of prior probabilities to three common approaches: Application to simulations and mouse urinary chemosignals, Chemom Intell Lab Syst, 2009, 99, 111-120.
  DOI: 10.1016/j.chemolab.2009.07.016
If you decide for a single run on a hold-out test set (no iterations/repetitions),
- keep in mind that most of the mistakes you can do with cross validation (which will lead to an optimistic bias) can also happen with a hold-out test set.
- check the width of the resulting confidence interval for the performance measurement, and make sure that this allows meaningful interpretation of the results (see sample size planning paper).

Best Answer

Related Solutions

Solved – Cross validation and prediction for unknown data

Solved – Repeated 100x 10-fold cross validation, what is the sample size when doing an significance test

Related Question