Solved – Intuitive explanation of stratified cross validation and nested cross validation

classificationcross-validationhyperparameterself-study

According to the approach outlined here. I should split the dataset into training set and an independent test set using stratified cross-validation. This is the hold out way of splitting the dataset using stratified approach.

Using the training set, I will do k fold cross-validation for hyperparameter tuning and model selection. I have the following questions for which I could not find answers from the document (I don't have access to the book by the author of the blog). Shall be grateful for help.

Question 1) Is this approach of two way split and model selection using k fold cross-validation known as the nested cross-validation?

Question 2) Considering an imbalanced dataset with 80 examples from class 0 records and 20 examples belonging to class 1 records.Can somebody please explain with a simple example what is the meaning of ensuring that each fold is representative of all strata of the data? What the output of stratification will be for this example? Is it the same as with and without replacement?

Question 3) During model selection using k fold cross-validation, should the k folds be obtained using stratified cross validation again even if the dataset had been split using stratified cross-validation into training and an independent test set?

Best Answer

1) You have correctly described nested cross-validation. The idea is to use an internal cross-validation loop to tune model hyperparameters before evaluating its performance. You could also do this by having an outer CV loop, rather than just a single train/test split.

2) For a dataset of 100 samples, 20 of class 0 and 80 of class 1, a 10-fold stratified crossvalidation should have 2 samples of class 0 and 8 samples of class 1 in each fold. This ensures that your training and testing data in each fold are truly representative of your full population. Without stratification, it'd be possible to get a 5%/95% or 50%/50% split in your training data, which would not match up with the sampling population. Stratification arguably becomes less important as your sample size grow, since the law of large numbers indicates that your fold proportions will approach the population proportions. With small sample size, however, it becomes more likely that your fold sampling could imbalance the training data in a way you don't intend.

3) Ideally, all your training data should be representative of the population, so yes, stratifying both internal and external CV folds is good practice.