Solved – If you use 10-fold cross validation, which tree is representative

cross-validation

If you use 10-fold cross validation to derive the error in, say, a C4.5 algorithm, then you are essentially building 10 separate trees on 90% of the data to test on 10% – 10 times. Which one of the 10 trees is representative? Won't they all be different?

For example – how does WEKA give me a C4.5 tree and a cross-validation error, but only one. I feel I must have fundamentally misunderstood this.

Thanks for any help

Best Answer

Typically, you use the 10 cross-validated trees to estimate "out-of-sample" error, and then fit an 11th and final tree on the full dataset.

In theory, the error of the 11th tree on out-of-sample data should be similar to the out-of-sample error you estimated from the 10 cross-validated trees.