Solved – Does it make sense to do Cross Validation with a Small Sample

cross-validationsample-sizesmall-sample

I have a set with 16 samples and 250 predictors. I'm being asked to perform CV on the set. In the examples I've looked at, you create training and testing subsets. The sample size seems quite small to me to split to even smaller subsets. My question is, does CV make sense with a small sample.

Best Answer

I have concerns about involving 250 predictors when you have 16 samples. However, let's set that aside for now and focus on cross-validation.

You don't have much data, so any split from the full set to the training and validation set is going to result in really very few observations on which you can train. However, there is something called leave-on-out cross validation (LOOCV) that might work for you. You have 16 observations. Train on 15 and validate on the other one. Repeat this until you have trained on every set of 15 with the 16th sample left out. The software you use should have a function to do this for you. For instance, Python's sklearn package has utilities for LOOCV. I'll include some code from the sklearn website.

# https://scikit-learn.org/stable/modules/generated/
# sklearn.model_selection.LeaveOneOut.html
#
>>> import numpy as np
>>> from sklearn.model_selection import LeaveOneOut
>>> X = np.array([[1, 2], [3, 4]])
>>> y = np.array([1, 2])
>>> loo = LeaveOneOut()
>>> loo.get_n_splits(X)
2
>>> print(loo)
LeaveOneOut()
>>> for train_index, test_index in loo.split(X):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
...    print(X_train, X_test, y_train, y_test)
TRAIN: [1] TEST: [0]
[[3 4]] [[1 2]] [2] [1]
TRAIN: [0] TEST: [1]
[[1 2]] [[3 4]] [1] [2]

Do you, by any chance, work in genetics?

Related Question