I found this confusing when I use the neural network toolbox in Matlab.
It divided the raw data set into three parts:
- training set
- validation set
- test set
I notice in many training or learning algorithm, the data is often divided into 2 parts, the training set and the test set.
My questions are:
- what is the difference between validation set and test set?
- Is the validation set really specific to neural network? Or it is optional.
- To go further, is there a difference between validation and testing in context of machine learning?
Best Answer
Typically to perform supervised learning, you need two types of data sets:
In one dataset (your "gold standard"), you have the input data together with correct/expected output; This dataset is usually duly prepared either by humans or by collecting some data in a semi-automated way. But you must have the expected output for every data row here because you need this for supervised learning.
The data you are going to apply your model to. In many cases, this is the data in which you are interested in the output of your model, and thus you don't have any "expected" output here yet.
While performing machine learning, you do the following:
The validation phase is often split into two parts:
Hence the separation to 50/25/25.
In case if you don't need to choose an appropriate model from several rivaling approaches, you can just re-partition your set that you basically have only training set and test set, without performing the validation of your trained model. I personally partition them 70/30 then.
See also this question.