Machine Learning – Why Only Three Partitions (Training, Validation, Test)

data miningmachine learningmodel selection

When you are trying to fit models to a large dataset, the common advice is to partition the data into three parts: the training, validation, and test dataset.

This is because the models usually have three "levels" of parameters: the first "parameter" is the model class (e.g. SVM, neural network, random forest), the second set of parameters are the "regularization" parameters or "hyperparameters" (e.g. lasso penalty coefficient, choice of kernel, neural network structure) and the third set are what are usually considered the "parameters" (e.g. coefficients for the covariates.)

Given a model class and a choice of hyperparameters, one selects the parameters by choosing the parameters which minimize error on the training set. Given a model class, one tunes the hyperparameters by minimizing error on the validation set. One selects the model class by performance on the test set.

But why not more partitions? Often one can split the hyperparameters into two groups, and use a "validation 1" to fit the first and "validation 2" to fit the second. Or one could even treat the size of the training data/validation data split as a hyperparameter to be tuned.

Is this already a common practice in some applications? Is there any theoretical work on the optimal partitioning of data?

Best Answer

First, I think you're mistaken about what the three partitions do. You don't make any choices based on the test data. Your algorithms adjust their parameters based on the training data. You then run them on the validation data to compare your algorithms (and their trained parameters) and decide on a winner. You then run the winner on your test data to give you a forecast of how well it will do in the real world.

You don't validate on the training data because that would overfit your models. You don't stop at the validation step's winner's score because you've iteratively been adjusting things to get a winner in the validation step, and so you need an independent test (that you haven't specifically been adjusting towards) to give you an idea of how well you'll do outside of the current arena.

Second, I would think that one limiting factor here is how much data you have. Most of the time, we don't even want to split the data into fixed partitions at all, hence CV.