Solved – Difference between train/test and train/validate/test split

cross-validationmachine learningmodeltrainvalidation

I know this question has been asked here before, but after reading the answers I still dont get the difference.

Consider for instance a lasso penalized linnear regression model.This model has a penalization parameter $\lambda$ that controls the level of shrinkage applied. So (in general) different $\lambda$ values generate different $\beta$ parameters. In this kind of situations, I am used to work with a train/test split, performing cross-validation over the training sample in order to find the penalization parameter that reduces the prediction error, and once I found the optimal $\lambda$ I compute the actual prediction error over the test split.

However, in several papers, people consider a train/validate/test split. I found the following description of this train/validate/test split in The Elements of Statistical Learning book:

The training set is used to fit the models; the validation set is used
to estimate prediction error for model selection; the test set is used
for assessment of the generalization error of the final chosen model

So as far as I can understand, I usually do the validation step estimating the prediction error over different $\lambda$ values to find the optimal one, and then I do the test step obtaining the error of the chosen model. But what am I supposed to do on the training step?

It here says "fit the models". Does this mean using the train step to (roughly talking) obtain the $\beta$ associated to different $\lambda$? But, if so, what would be the difference between the error computed using the validation set and the test set? Neither of these sets would have been used in the creation of the model, so both of them are independent of the training set.

Best Answer

I would like to add on @Vishal answer.

The accuracy you get on the validity or test set is not the true accuracy, but an estimate of it, and as such it has an uncertainty.

Typically, you will chose the lasso model (trained on the training set) with $\lambda = \lambda_{best}$ giving the best accuracy (or F1score or whatever measure you prefer to rank your models) on the validation set (call it $M_{\lambda_{best}}$), because that is the model with probably the best accuracy. But the estimate of the accuracy you get on the validation set is probably an overestimate. Probably the accuracy was so high because the validation set mostly contained cases that $M_{\lambda_{best}}$ can classify well/better than the other models $M_\lambda$, with $\lambda \neq \lambda_{best}$ (a sort of selection bias). So yes, the accuracy estimation from the validation set is biased.

Instead the accuracy estimate you perform on the test set is unbiased.

Related Question