Machine Learning – Should RepeatedKFold Use Full Data or Only Train Data?

classificationcross-validationmachine learningmathematical-statisticsrandom forest

I am working on a binary classification using random forest with a dataset size of 977 records and 6 columns. class ratio is 77:23 (imbalanced dataset)

Since, my dataset is small, I learnt that it is not advisable to split using regular train_test split of 70 and 30.

So, I was thinking to do repeatedKfold CV. Please find my code below

Approach 1 – Full data – X, y

rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X,y, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))

But I see that we have full input data X passed at once to the model. Doesn't this lead to data leakage? Meaning, if I am doing categorical encoding, we have to do based on all categories encountered in full dataset. Similarly, consider if a dataset ranges from the year 2017 to 2022. It is possible that model uses 2021 data in one of the folds and validate it on the 2020 data.

So, is it right to use repeatedKfold like the below?

Approach 2 – only train data – X_train, y_train

rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X_train,y_train, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))

I ask because the tutorials that I see online here and here all use full dataset (X,y) as input to RepeatedKfold

Can help me understand which will be the best approach to use?

Best Answer

The KFold is used to estimate the performance, not to tune the hyper-parameters. Therefore, you should use the full dataset, and it does not cause data leakage. The test is always performed on a holdout set. Repeating this procedure multiple times is just Repeated KFold. Note that KFold CV is different than time series CV, and it assumes no temporal dependency between the samples.

Related Question