R – Using SMOTE Data Balance Before or During Cross-Validation

cross-validationrsmoteunbalanced-classes

I'm using Random Forest in the CARET package to tag a binary outcome with 1/10 ratio, thus I need to balance the dataset.

I know two ways:

  1. Use SMOTE as a stand-alone function and then pass it to the training.

  2. Use sampling='smote' inside CARET's training

As far as I understand, the first approach should be better, for it uses the whole data set to synthesize new samples (I know it uses only the 5 nearest neighbors by default, but still have more data points to choose from) while the second method only uses the data points available in each partition of the CV.

However, are there any benefits in balancing inside the CV?

Best Answer

The second method should be preferred for exactly the reason that you gave to justify the first. The first method uses the whole data set to synthesize new samples. Cross validation is excluding points from training to give an accurate assessment of the error rate on new data. If you use SMOTE first, information from the excluded points will be leaked into the training data and will taint the XV testing.