Solved – Is validation set always necessary

feature selectionmachine learningmodel selectionvalidation

Lets say I did the following steps:

  1. Used some separate development set to select some features.
  2. Decided a priori to use only one learning algorithm (SVM) with only default parameter values.
  3. Trained a single model on a training set.
  4. Tested this model on the test set.

Is it OK that I didn't use a validation set, given that I had only one model a priori?

Is this acceptable in a scientific work?

Have in mind that the purpose of my work was only to show that the feature selection was good by showing that even some standard learning algorithm with its default parameter values can learn these features and get good accuracy. I don't claim that I've found the best learning method for my problem (my work is on IR and only uses ML, it's not about ML).

Best Answer

Thinking of training/test/validation as involving different subsets of the data is not necessarily a good idea. First of it it takes enormous samples to be able to get precise accuracy estimates when data splitting. More precise estimates of likely future performance of predictive models can be had by using rigorous Efron-Gong "optimism" bootstrapping using the whole sample to develop the model and the whole sample to get a nearly unbiased estimate of future performance for observations from the same stream.

Note that even if you pre-specify a single model, validation may be needed if there are many parameters in the model.

Regarding acceptance in scientific work, see the end of Chapter 9 of Biostatistics for Biomedical Research at http://biostat.mc.vanderbilt.edu/ClinStat .