There's nothing wrong with the (nested) algorithm presented, and in fact, it would likely perform well with decent robustness for the bias-variance problem on different data sets. You never said, however, that the reader should assume the features you were using are the most "optimal", so if that's unknown, there are some feature selection issues that must first be addressed.

**FEATURE/PARAMETER SELECTION**

A lesser biased approached is to never let the classifier/model come close to anything remotely related to feature/parameter selection, since you don't want the fox (classifier, model) to be the guard of the chickens (features, parameters). Your feature (parameter) selection method is a $wrapper$ - where feature selection is bundled inside iterative learning performed by the classifier/model. On the contrary, I always use a feature $filter$ that employs a different method which is far-removed from the classifier/model, as an attempt to minimize feature (parameter) selection bias. Look up wrapping vs filtering and selection bias during feature selection (G.J. McLachlan).

There is always a major feature selection problem, for which the solution is to invoke a method of object partitioning (folds), in which the objects are partitioned in to different sets. For example, simulate a data matrix with 100 rows and 100 columns, and then simulate a binary variate (0,1) in another column -- call this the grouping variable. Next, run t-tests on each column using the binary (0,1) variable as the grouping variable. Several of the 100 t-tests will be significant by chance alone; however, as soon as you split the data matrix into two folds $\mathcal{D}_1$ and $\mathcal{D}_2$, each of which has $n=50$, the number of significant tests drops down. Until you can solve this problem with your data by determining the optimal number of folds to use during parameter selection, your results may be suspect. So you'll need to establish some sort of bootstrap-bias method for evaluating predictive accuracy on the hold-out objects as a function of varying sample sizes used in each training fold, e.g., $\pi=0.1n, 0.2n, 0,3n, 0.4n, 0.5n$ (that is, increasing sample sizes used during learning) **combined** with a varying number of CV folds used, e.g., 2, 5, 10, etc.

**OPTIMIZATION/MINIMIZATION**

You seem to really be solving an optimization or minimization problem for function approximation e.g., $y=f(x_1, x_2, \ldots, x_j)$, where e.g. regression or a predictive model with parameters is used and $y$ is continuously-scaled. Given this, and given the need to minimize bias in your predictions (selection bias, bias-variance, information leakage from testing objects into training objects, etc.) you might look into use of employing CV during use of swarm intelligence methods, such as particle swarm optimization(PSO), ant colony optimization, etc. PSO (see Kennedy & Eberhart, 1995) adds parameters for social and cultural information exchange among particles as they fly through the parameter space during learning. Once you become familiar with swarm intelligence methods, you'll see that you can overcome a lot of biases in parameter determination. Lastly, I don't know if there is a random forest (RF, see Breiman, Journ. of Machine Learning) approach for function approximation, but if there is, use of RF for function approximation would alleviate 95% of the issues you are facing.

Cross-validation usually helps to avoid the need of a validation set.

The basic idea with training/validation/test data sets is as follows:

Training: You try out different types of models with different choices of hyperparameters on the *training data* (e.g. linear model with different selection of features, neural net with different choices of layers, random forest with different values of mtry).

Validation: You compare the performance of the models in Step 1 based on the *validation set* and select the winner. This helps to avoid wrong decisions taken by overfitting the training data set.

Test: You try out the winner model on the *test data* just to get a feeling how good it performs in reality. This unravels overfitting introduced in Step 2. Here, you would not take any further decision. It is just plain information.

Now, in the case where you replace the validation step by cross-validation, the attack on the data is done almost identically, but you only have a training and a test data set. There is no need for a validation data set.

Training: See above.

Validation: You do cross-validation on the training data to choose the best model of Step 1 with respect to cross-validation performance (here, the original training data is repeatedly split into a temporary training and validation set). The models calculated in cross-validation are only used for choosing the best model of Step 1, which are all computed on the full training set.

Test: See above.

## Best Answer

## My basic understanding is that the machine learning algorithms are specific to the training data.

The weights of parameters for machine learning models are determined by the training data. Depending on the size of your training data, small variations in training data can have significant effects on the weights of the composed model. Neural Networks trained on small training data can also experience fluctuation if the weights are randomly assigned to each neuron before training.

## Therefore, if the model is changed each time then how the accuracy in all k iterations is reliable?

The weights of the parameters probably will change with each iteration of k-fold CV due to varying training data. This is expected, but shouldn't discourage use. One of the primary purposes of performing k-fold CV is to assess how well your model structure generalizes to test data, which can be evaluated using the validation error on each fold. By structure, I'm referring to type of model used (linear, logistic, neural network, etc) and/or parameters used (using variables $x,y$ and $z$ versus just using variables $x$ and $y$). We compare how each model structure generalizes to the test data, and we can determine if certain structures are over-fitting to the training data.

## Is it better to split the model in training, validation, and testing than performing k-fold cross-validation?

I'm not sure exactly what you're proposing here, but k-fold CV is a method of splitting your training data into training and validation sets, and once you have settled on a model structure, you train the model on the entirety of the training data. Generally the more data in the training set, the more representative a sample we have of the data we are trying to predict; and if this is true then the more generalizable the model should be.

Creating a second level of testing may be seen as redundant, as I'm not sure what you would gain besides incremental testing/training of your model. I would defer to a more knowledgeable person on this.