Choose the mean and std when using KFold cross validation

cross-validationmachine learning

With reference to this post on feature scaling, and many tutorials out there, it is mentioned how we should avoid data snooping by performing say, feature scaling on the train set, and then use the mean and std from the train set to transform the test set.

I can understand this idea, but when extending to say, KFold cross validation (K=5), how do we then determine the mean and std for our final test set?

I was thinking the below:

  1. Split X, y into 3 sets, X_train, X_test (note no validation set here since we will be splitting X_train into 5 folds).
  2. X_train is further split into X1,X2,X3,X4,X5 (for simplicity, I dropped the train subfix.
  3. We will train the model five times, for example, use X2 – X5 as train and evaluate on X1. Use X1-X4 as train, and evaluate on X5.
  4. The confusion arises here, during training for each fold, we should perform feature scaling on only 4 folds (the training folds), and use the mean and std from these 4 folds to transform the remaining 1 validation fold. But this will mean we have 5 such pairs of mean – std since CV=5.
  5. How do we deduce which mean and std to use for the final final prediction on test set? Logically, we just choose the mean and std of the best performing fold (out of 5 folds?) This leads to the final question.
  6. Don't people typically take the average of the 5 fold predictions at the end when doing inference, so do we still take the best performing folds' mean and std?

Best Answer

In the presence of a test set, the Kfold is tpically done for hyper parameter optimization, and after that, you should train your model with the whole training set and estimate the mean/std from it to be used on the test set.