Solved – Training set, test set and validation set

classificationcross-validation

In order to do a data mining work I have to find the best classifiers for my data. What I wonder is if I have to divide my data set into a training test and a test set ? I have to choose between 5 models : Naive Bayes, Neural network, K-NN, decision tree and logistic regression. My point is that if I want to know the performance of each model (by comparing errors for each model), I intended to do a cross-validation and select the best one(using MSE). So my question is, why splitting my data at the beginning if I am assessing the performance of each of the 5 models with cross-validation. I hope I am clear enough.
Thank for your help in advance.

Best Answer

So my question is, why splitting my data at the beginning if I am assessing the performance of each of the 5 models with cross-validation.

Every classifier has some parameter(s) to tune. You need to find the optimum parameters based on 10-fold cross validation in the training set. Later you have to evaluate the model with optimum parameters on the test set. This makes the division of the dataset important. For finding the optimum parameters you can use 50% of the dataset. Later on, you may evaluate the classifier with rest 50% which is the test set.

Additionally, after finding the optimum parameters with 50% training data, you can use the complete data and evaluate the classifiers based on 10-fold cross validation. Performances with 50% test data and the cross validation one would almost be close.