Solved – Difference between training, test and holdout set data mining model building

machine learningterminologyvalidation

What is the difference between training, test, and holdout sets?

I know these concepts, just want to ensure that I have understood correctly.

Training set is something that we have as of now. We will remove subset from it and removed subset will be called holdout set.

We will build models using remaining data (what remains after removing holdout set) and the holdout set is used to finalized estimates of tuning parameters (step 1)

Then we will build a final model on the entire Training set (including holdout set). Tuning parameters values are same as that we got from step 1.

Test data is something that we get in future. We don't know their Y/dependent variable value and we predict it using our model.

Best Answer

Well, Hastie, Tibshirani, and Friedman, in their seminal The Elements of Statistical Learning (page 222), say to break the data into three sections:

  1. Training (50%)
  2. Validation (25%)
  3. Testing (25%)

Where the model is built on the training set, the prediction errors are calculated using the validation set, and the test set is used to assess the generalization error of the final model. This test set should be locked away until the model calibration process is finished to prevent underestimation of the true model error.

Hastie, T.; Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction Springer Science+Business Media, Inc., 2009

Related Question