Classification – How to Choose the Right Number of CV Folds for Nested Cross-Validation with a Small Sample

classificationcross-validationsample-size

I have a small imbalanced sample of 18 and 12 records in 0-labeled and 1-labeled groups respectively with 6 features, so the size of my feature matrix X is $30 \times 6$ and y is my target vector $30 \times 1$.

The data in CSV format is shared via the dropbox link:

THE PROBLEM

I would like to assess my classification accuracy and to do so, I perform the stratified nested cross validation: the outer loop iterates over different train-test samples, and the inner loop selects the best model. The sklearn (0.18.1) implementation of the code:

sss_outer = StratifiedShuffleSplit(n_splits=10, test_size=0.333, random_state=None)
sss_inner = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=None)

parameters = {'clf__C': logspace(-4,3,150)}
pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', LogisticRegression(penalty='l1'))])
grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring=make_scorer(accuracy_score), cv=sss_inner)
cross_val_score(grid_search, X, y, cv=sss_outer, scoring='accuracy')

Also, in order to see how many instances are in each partitioned group, I created a simple function "nested_partitioning_calculator()", which quickly computes the number of instances in each split.

nested_partitioning_calculator(X, y, sss_inner, sss_outer)
----------------- Outer Split: Train+Validate and Test -----------------
[8, 12] [4, 6]
[8, 12] [4, 6]
[8, 12] [4, 6]
[8, 12] [4, 6]
[8, 12] [4, 6]
[8, 12] [4, 6]
[8, 12] [4, 6]
[8, 12] [4, 6]
[8, 12] [4, 6]
[8, 12] [4, 6]
----------------- Inner Split: Train and Validate -----------------
[6, 8] [2, 4]
[6, 8] [2, 4]
[6, 8] [2, 4]
[6, 8] [2, 4]
[6, 8] [2, 4]
[6, 8] [2, 4]
[6, 8] [2, 4]
[6, 8] [2, 4]
[6, 8] [2, 4]
[6, 8] [2, 4]

From this example, it's clear, that the outer CV loop splits [12, 18] data (18-in 0 class, and 12 in 1 class) into [8, 12] for the training sample and [4, 6] and then, the inner loop splits again [8, 12] into training set [6, 8] and validation set [2, 4]. So, basically, the training is performed on a small set of 6 instances from the 1 class and 8 instances from the 0 class.

THE QUESTION

Obviously, different values in the test_size parameter will yield different CV accuracy results. What is the proper way to split the small data set in the case of the nested CV? Should I be aiming at smaller test and validation samples? (aka LOOCV?).

Best Answer

@John is right that sampling variability is your problem. In particular, the variance on the performance estimates.

In contrast to his advise, I'd strongly recommend not to do LOO. The main reason for that (apart from the possible complication of strong pessimistic bias due to inherent lack of stratification) is that with LOO you cannot distinguish two different sources of variance:

  • variance due to the limited number of cases tested and
  • variance due to model instability (i.e. due to the training sample size being so limited that exchanging a few training cases does make a difference). Model instability is one symptom of unsuccessful optimization.

Doing e.g. repeated k-fold cross valiation (or out-of-bootstrap, ...), you can separate these influences as you can check whether predictions for the same case by different surrogate models are the same or not (= model instability). The more aggressively you optimize in the inner loop, the more important it is to make sure the optimization yields stable results (across the surrogate models of the outer loop).

Now one consequence of your limited number of cases is that the estimates of model performance will have high variance due to the low number of test cases. If you work with 0/1 loss and e.g. accuracy*, you can do some back-of-the-envelope calculations what uncertainty to expect.

  • outer loop has 30 cases. At the end, all those have been tested. The best possible case is that all were correctly - outer loop has 30 cases. At the end, all those have been tested. The best possible case is that all were correctly predicted. A binomial 95% confidence interval for 30 out of 30 cases yields roughly 90 - 100% accuracy.

  • say you do 6-fold in the outer loop (which you can do nicely stratified for your application). Then the optimization has 25 cases, and a correspondingly higher confidence interval for its performance estimates.

Without going into calculations, I think it unlikely that the expected differences across the models compared in the optimization step differ enough to reliably measure this difference with only 25 or 30 cases available.

Thus I recommend considering not to do any optimization but restrict yourself to a model where you can fix the hyperparameters by external knowledge (if there are any). We wrote a paper on a closely related topic that may be of interest:
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007

accepted manuscript on arXiv: 1211.1323

* there are other figures of merit, e.g. proper scoring rules, that are much better behaved from a statistical point of view. Nevertheless, they usually don't provide miracles, neither.


update: Plausibility check whether doing an optimization is worth while:

  • take an unoptimized model that doesn't need hyperparameters or that is calculated with manually set plausible hyperparameters (e.g. logistic regression without regularization, or random forest with manually set hyperparameters) and cross validate this (total = 30 tested cases).
    Let's assume you get 31 correct = 70% accuracy.
  • Check e.g. by simulated McNemar's tests how much better the optimized model would need to be in order to recognize the superiority.
    In the example, McNemar's test would be significant if the optimized model had 90% accuracy in the paired test without making any error that the reference model didn't make. Or it may make one new error at accuracy > 93%.
    It is then up to you to judge how realistic it is to expect such an improvement from the optimization and whether it is worth trying.

  • similarly, you can check with a proportion test simulation what performance you'd need to observe in order to have performance significantly better than, say, random guessing of the class label.

Related Question