Cross-Validation – Why Use Stratified Cross Validation and Its Effect on Variance

cross-validationresamplingstratification

I've been told that is beneficial to use stratified cross validation especially when response classes are unbalanced. If one purpose of cross-validation is to help account for the randomness of our original training data sample, surely making each fold have the same class distribution would be working against this unless you were sure your original training set had a representative class distribution.

Is my logic flawed?

EDIT
I'm interested in whether this method damages the good of CV. I can see why it is necessary if you have a small sample/very unbalanced classes/both to avoid not having a single representative of the minor class in a fold.

The paper Apples-to-Apples in Cross-Validation Studies:
Pitfalls in Classifier Performance Measurement
puts forward the case for stratification well, but all arguments seem to amount to 'Stratification provides a safeguard and more consistency' but no safeguard would be required given enough data.

Is the answer simply "We use it out of necessity as we rarely have enough data."?

Best Answer

Bootstrapping seeks to simulate the effect of drawing a new sample from the population, and doesn't seek to ensure distinct test sets (residues after N from N sampling with replacement).

RxK-fold Cross-validation ensures K distinct test folds but is then repeated R times for different random partitionings to allow independence assumptions to hold for K-CV, but this is lost with repetition.

Stratified Cross-validation violates the principal that the test labels should never have been looked at before the statistics are calculated, but this is generally thought to be innocuous as the only effect is to balance the folds, but it does lead to loss of diversity (an unwanted loss of variance). It moves even further from the Boostrap idea of constructing a sample similar to what you'd draw naturally from the whole population. Arguably the main reason stratification is important is to address defects in the classification algorithms, as they are too easily biased by over- or under-representation of classes. An algorithm that uses balancing techniques (either by selection or weighting) or optimizes a chance-correct measure (Kappa or preferably Informedness) is less impacted by this, although even such algorithms can't learn or test a class that isn't there.

Forcing each fold to have at least m instances of each class, for some small m, is an alternative to stratification that works for both Bootstrapping and CV. It does have a smoothing bias, making folds tend to be more balanced than they would otherwise be expected to be.

Re ensembles and diversity: If classifiers learned on the training folds are used for fusion not just estimation of generalization error, the increasing rigidity of CV, stratified Bootstrap and stratified CV leads to loss of diversity, and potentially resilience, compared to Bootstrap, forced Bootstrap and forced CV.