Regression – Train and Validation vs. Train, Test, and Validation in Machine Learning

cross-validationmachine learningregressiontrainvalidation

I am embarking on a new job that will give me the opportunity to do some cool machine learning stuff. I haven't touched this stuff on a deeper level since graduate school and I wanted to get some clarification on some concepts.

The way I was taught ML is that you split up your data (80/20) into training and validation datasets. You fit your model to the 80% training split and get an error rate, loss, etc. through cross validation. Then, you take the fitted model you constructed with the training data and pop in the 20% validation dataset to compare if the error rates, loss, etc are similar. If so, the model is good.

I have been doing some research to refresh my knowledge, and I've been noticing 3-way splits now (training/test/validation) where the split is usually (70/20/10). I'm so confused on how this 3-way split is different from the 2-way split I was taught in school. Also, I'm pretty sure I've been interchanging test with validation when referring to the 2-way split methodology.

Can someone verify if my understanding of the 2-way split is correct and explain the difference between that and the 3-way split?

Thank you!

Best Answer

In your two-way split, as you also mentioned, your validation set is actually your test set. In your way, you haven't mentioned about hyperparameter optimisation (HPO), but it's a key step in many machine learning algorithms. When you need HPO, you'll either need to have a separate validation set to tune the HPs or tune them using cross validation over the training set. In the end, the model is trained over the whole training dataset and tested over the test set.

In your ML algorithm, if you don't need to optimise HPs, you can obtain loss metrics using cross-validation over the training set as you did, but this could have been done by using the entire dataset as well, i.e. you have five 80-20 splits, and average the loss across folds. You don't need a two-level test.