I am working on a binary classification using random forest with a dataset size of 977 records and 6 columns. class ratio is 77:23 (imbalanced dataset)
Since, my dataset is small, I learnt that it is not advisable to split using regular train_test split of 70 and 30.
So, I was thinking to do repeatedKfold CV. Please find my code below
Approach 1 – Full data – X, y
rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X,y, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))
But I see that we have full input data X
passed at once to the model. Doesn't this lead to data leakage? Meaning, if I am doing categorical encoding, we have to do based on all categories encountered in full dataset. Similarly, consider if a dataset ranges from the year 2017 to 2022. It is possible that model uses 2021 data in one of the folds and validate it on the 2020 data.
So, is it right to use repeatedKfold
like the below?
Approach 2 – only train data – X_train, y_train
rf_boruta = RandomForestClassifier(class_weight='balanced',max_depth=3,max_features='sqrt',n_estimators=300)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=100)
scores = cross_val_score(rf_boruta,X_train,y_train, scoring='f1', cv=cv)
print('mean f1: %.3f' % mean(scores))
I ask because the tutorials that I see online here and here all use full dataset (X,y) as input to RepeatedKfold
Can help me understand which will be the best approach to use?
Best Answer
The KFold is used to estimate the performance, not to tune the hyper-parameters. Therefore, you should use the full dataset, and it does not cause data leakage. The test is always performed on a holdout set. Repeating this procedure multiple times is just Repeated KFold. Note that KFold CV is different than time series CV, and it assumes no temporal dependency between the samples.