Solved – How to choose training and test sets

cross-validationvalidation

I would like to propose a single model (decision tree), that is very variable, and validate it. I have choosen parameters after I had obtained good quality measures with a cross-validation.

I could build the model on the whole data set and show cross-validated measures. But I can't get a special graph (called Reliability Plot) specific for that model. I should split my data set in training and test sets to obtain that specific graph. The model builded on the training set is different from the optimezed on the whole dataset.

Could I choose my training set (50% of the total) to obtain the same model as the whole data set builded one? There is something unwise or wrong in this method?

Thanks

Best Answer

If you have tuned the model parameters using cross-validation, then you won't get an unbiased estimnate of performance without using some completely new data. Even if you re-cross-validate using a different partition of the data, or make a random test/training split using the data you have used already, this will still bias the performance evaluation.

Note the "cross-validated measures" you already have are a (possibly heavily) biased performance estimate if you have directly optimised it to choose the (hyper-) prarameters.

The thing to do would be to used a nested cross-validation, where the outer cross-validation is used for performance estimation, where the model parameters are tuned indpendently in each fold via an "inner" cross-validation.