Cross-Validation – Does K-Fold Change the Model in Each Iteration?

cross-validationrandom forest

My basic understanding is that the machine learning algorithms are specific to the training data. When we change the training data, the model also changes.

If my understanding is correct, then while performing k-fold cross-validation, the training data is changed in each k iteration so is the model. Therefore, if the model is changed each time then how the accuracy in all k iterations is reliable?

Is it better to split the model in training, validation, and testing (just perform hyperparameter tuning based on training and validation sets and test it on the testing set) than performing k-fold cross-validation?

Best Answer

My basic understanding is that the machine learning algorithms are specific to the training data.

The weights of parameters for machine learning models are determined by the training data. Depending on the size of your training data, small variations in training data can have significant effects on the weights of the composed model. Neural Networks trained on small training data can also experience fluctuation if the weights are randomly assigned to each neuron before training.

Therefore, if the model is changed each time then how the accuracy in all k iterations is reliable?

The weights of the parameters probably will change with each iteration of k-fold CV due to varying training data. This is expected, but shouldn't discourage use. One of the primary purposes of performing k-fold CV is to assess how well your model structure generalizes to test data, which can be evaluated using the validation error on each fold. By structure, I'm referring to type of model used (linear, logistic, neural network, etc) and/or parameters used (using variables $x,y$ and $z$ versus just using variables $x$ and $y$). We compare how each model structure generalizes to the test data, and we can determine if certain structures are over-fitting to the training data.

Is it better to split the model in training, validation, and testing than performing k-fold cross-validation?

I'm not sure exactly what you're proposing here, but k-fold CV is a method of splitting your training data into training and validation sets, and once you have settled on a model structure, you train the model on the entirety of the training data. Generally the more data in the training set, the more representative a sample we have of the data we are trying to predict; and if this is true then the more generalizable the model should be.

Creating a second level of testing may be seen as redundant, as I'm not sure what you would gain besides incremental testing/training of your model. I would defer to a more knowledgeable person on this.