In K fold cross validation, we divide the dataset into k folds, where we train the model on k-1 folds and test the model on the remaining fold. We do so until all the folds were assigned as the test set. In every of these iterations, we will train a new model that is independent of the model created from the previous iteration ( every iteration uses a new instance of the model). So my question is, if I divide my dataset into train and test sets, then I have used only the training set for the k cross validation process, and since every iteration uses a new model, what is the output model from this k fold cross validation process that I should use to evaluate it ( calculates the ROC curve, F1-score, precision and so on) using the test set ?? (As I have different models for every iteration). One way to implement k fold cross validation is to use sklearn.model_selection.cross_val_score and this returns only an array of scores of the model for each run of the cross validation and this confirms my problem, where there is no model is returned to be further evaluated by the test set. What should I do in this case ?
Confusion regarding K-fold Cross Validation
cross-validationmachine learningscikit learn
Related Solutions
Cross validation with k folds means you will have to split you data set in k disjoint groups. In your case for 10-folds you split your data set in 10 disjoint groups each with 400 samples ($G_i$ with $i$ from 1 to 10). Usually the groups should have roughly the same size.
Now do the following:
- Train your classifier on $Train_1 = G_2\cup G_3 \cup ... \cup G_{10}$ and test it on $Test_1 = G_1$. Save test results for later use.
- Train your classifier on $Train_2 = G_1 \cup G_3 \cup .. \cup G_{10}$ and test on $Test_2 = G_2$ and save results for later use.
- Repeat another 8 steps and collect the results.
Now you have for each instance of your dataset, how it was classified, since the reunion of all $Test_i$ is the original data set (each group $G_i$ is tested once). You can measure how do you like the errors.
Now there are a couple of things which I believe you have to pay some attention. You said you have 20 target classes and 4000 samples. I do not know about your specific problem, but it does not seem to have plenty of data. So, I believe is better to do multiple cross validations and average the results, thus you decrease the chance to get too biased results.
Another thing to pay attention for is how do you build your folds. You might use simple random sampling, but I believe is better to use a stratified random procedure. Thus you increase the chances to have a usable CV estimation.
You might also consider bootstrap testing if you do not have enough instances for a 10-fold cross validation with stratified sampling.
The two approaches are the same from a training perspective, as both use cross-validation. If you were to use the same k and the data was significantly large, there should be no difference.
The only difference is that in approach 2 you evaluate on an unseen 20% of the data.
For the second approach we use 80% of the data for training split into 60-20. So k = 20/80 = 4
Best Answer
If you use K-fold cross validation (CV) for hyper parameter tuning, you should train a single model on the entire training set with the best found hyper-parameters and test on the test set.
If you use K-fold CV for performance evaluation (like in sklearn cross_val_score), then you don't need to split your dataset into train/test. The performance reported in each fold will be a test performance. People usually average them or get all the predictions and then evaluate the entire dataset. This is usually done to assess performance when the dataset is small and there isn't a single model output for this case, nor the aim is to have it.