I have a set with 16 samples and 250 predictors. I'm being asked to perform CV on the set. In the examples I've looked at, you create training and testing subsets. The sample size seems quite small to me to split to even smaller subsets. My question is, does CV make sense with a small sample.
Solved – Does it make sense to do Cross Validation with a Small Sample
cross-validationsample-sizesmall-sample
Related Solutions
Let's look at three different approaches
In the simplest scenario one would collect one dataset and train your model via cross-validation to create your best model. Then you would collect another completely independent dataset and test your model. However, this scenario is not possible for many researchers given time or cost limitations.
If you have a sufficiently large dataset, you would want to take a split of your data and leave it to the side (completely untouched by the training). This is to simulate it as a completely independent dataset set even though it comes from the same dataset but the model training won't take any information from those samples. You would then build your model on the remaining training samples and then test on these left-out samples.
If you have a smaller dataset, you may not be able to afford to simply ignore a chunk of your data for model building. As such, the validation is performed on every fold (k-fold CV?) and your validation metric would be aggregated across each validation.
To more directly answer your question, yes you can just do cross-validation on your full dataset. You can then use your predicted and actual classes to evaluate your models performance by whatever metric you prefer (Accuracy, AUC, etc.)
That said, you still probably want to look in to repeated cross-validation to evaluate the stability of your model. Some good answers regarding this are here on internal vs. external CV and here on the # of repeats
@John is right that sampling variability is your problem. In particular, the variance on the performance estimates.
In contrast to his advise, I'd strongly recommend not to do LOO. The main reason for that (apart from the possible complication of strong pessimistic bias due to inherent lack of stratification) is that with LOO you cannot distinguish two different sources of variance:
- variance due to the limited number of cases tested and
- variance due to model instability (i.e. due to the training sample size being so limited that exchanging a few training cases does make a difference). Model instability is one symptom of unsuccessful optimization.
Doing e.g. repeated k-fold cross valiation (or out-of-bootstrap, ...), you can separate these influences as you can check whether predictions for the same case by different surrogate models are the same or not (= model instability). The more aggressively you optimize in the inner loop, the more important it is to make sure the optimization yields stable results (across the surrogate models of the outer loop).
Now one consequence of your limited number of cases is that the estimates of model performance will have high variance due to the low number of test cases. If you work with 0/1 loss and e.g. accuracy*, you can do some back-of-the-envelope calculations what uncertainty to expect.
outer loop has 30 cases. At the end, all those have been tested. The best possible case is that all were correctly - outer loop has 30 cases. At the end, all those have been tested. The best possible case is that all were correctly predicted. A binomial 95% confidence interval for 30 out of 30 cases yields roughly 90 - 100% accuracy.
say you do 6-fold in the outer loop (which you can do nicely stratified for your application). Then the optimization has 25 cases, and a correspondingly higher confidence interval for its performance estimates.
Without going into calculations, I think it unlikely that the expected differences across the models compared in the optimization step differ enough to reliably measure this difference with only 25 or 30 cases available.
Thus I recommend considering not to do any optimization but restrict yourself to a model where you can fix the hyperparameters by external knowledge (if there are any).
We wrote a paper on a closely related topic that may be of interest:
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323
* there are other figures of merit, e.g. proper scoring rules, that are much better behaved from a statistical point of view. Nevertheless, they usually don't provide miracles, neither.
update: Plausibility check whether doing an optimization is worth while:
- take an unoptimized model that doesn't need hyperparameters or that is calculated with manually set plausible hyperparameters (e.g. logistic regression without regularization, or random forest with manually set hyperparameters) and cross validate this (total = 30 tested cases).
Let's assume you get 31 correct = 70% accuracy. Check e.g. by simulated McNemar's tests how much better the optimized model would need to be in order to recognize the superiority.
In the example, McNemar's test would be significant if the optimized model had 90% accuracy in the paired test without making any error that the reference model didn't make. Or it may make one new error at accuracy > 93%.
It is then up to you to judge how realistic it is to expect such an improvement from the optimization and whether it is worth trying.similarly, you can check with a proportion test simulation what performance you'd need to observe in order to have performance significantly better than, say, random guessing of the class label.
Best Answer
I have concerns about involving 250 predictors when you have 16 samples. However, let's set that aside for now and focus on cross-validation.
You don't have much data, so any split from the full set to the training and validation set is going to result in really very few observations on which you can train. However, there is something called leave-on-out cross validation (LOOCV) that might work for you. You have 16 observations. Train on 15 and validate on the other one. Repeat this until you have trained on every set of 15 with the 16th sample left out. The software you use should have a function to do this for you. For instance, Python's sklearn package has utilities for LOOCV. I'll include some code from the sklearn website.
Do you, by any chance, work in genetics?