Solved – When to use stratified k-fold

cross-validation

According to a post on Analytics Vidhya:

Having said that, if the train set does not adequately represent the entire population, then using a stratified k-fold might not be the best idea. In such cases, one should use a simple k-fold cross validation with repetition.

I would like to get a better understanding of when one would choose stratified k-fold over a simple k-fold when cross validating. How would you test if your training set is representative of your entire dataset?

Best Answer

The quote is quite generic. It usually applies to scenarios where we have class imbalance and "simple $k$-fold" CV might result in training subsamples where we have no instances (or insufficient instances) of the minority class. In these cases our learning procedure might indeed be presented with an unrepresentative subsample to train with. Note that even in unbalanced samples this will be less of a problem as the overall sample size grows. There is no "hard-rule" on whether we should follow; we might want to ensure that our both training and test samples have instances of both classes but that is not too hard to ensure. Choosing stratified CV or simple CV is usually done when we have indication that our learning procedure is biased due to class imbalance; a usual metric that this problem might be reflect is having very low recall values but that is just one facet.

To give some concrete numbers on my point about sample size: let's assume that our minority class examples represent $1\%$ of our dataset and we use simple $5$-fold simple CV. Then approximately $(1-\frac{1}{5})^{\frac{1}{100}N}$ of our test subsamples would contain no instances of the minority class. i.e. even with $2100$ samples less that $1\%$ of our test subsamples would have no minority class instances. Similarly approximately $(1 - \frac{4}{5})^{\frac{1}{100}N}$ of our training subsamples will contain no instances of the minority class. i.e. even with "just" $300$ samples less that $1\%$ of our train subsamples would have no minority class instances. Clearly, having just one representative from a given class is not very helpful but it is evident that especially when training our classifier we quickly avoid totally unrepresentative samples. I would recommend reading the CV.SE threads: "Why use stratified cross validation? Why does this not damage variance related benefit?" and When is unbalanced data really a problem in Machine Learning?; both provide some further context on the use of stratified $k$-fold CV and imbalanced learning in particular.

I append a small R script that might help building intuition further.

M = 1000001  # Number of trials (reps)
SS = 400     # Sample size
K = 5        # Number of k-folds
PMC = 0.01;  # Proportion of minority class

# Prop. of test sets with no minority class instances
sum( replicate(M, 
     max( sample(SS, replace = FALSE, size = SS* (1/K)) ) <= SS * (1-PMC) ) )/M 

# Prop. of training sets with no minority class instances
sum( replicate(M, 
     max( sample(SS, replace = FALSE, size = SS* (1 - (1/K)) ) ) <= SS * (1-PMC) ) )/M