Solved – Train/test split for a small dataset (classification tree/ random forest)

cartclassificationmachine learningrandom forest

I'm trying to solve a classification problem using Classification Tree with a small amount of data available. Depending on the content of particular train/test subsets (80:20 proportion) the total accuracy varies much between the runs (I'm observing values about 40% but values like 20% or 75% are not uncommon). This makes it hard to make a decision if the parameters of the tree are chosen well (e.g. the independent variables, max tree height, etc.).

Therfore I decided to run e.g. 1000 such a trees and average the accuracies to estimate the mean accuracy (i.e. the accuracy I would obtain using large train and test subsets).
Is it a good approach?

Also, I have build a Random Forest in a similar way. I have split the dataset 1000 times into different train/test subsets (again 80:20), built 1000 trees and for each entry in my dataset found those ~200 trees that have not been trained with that particular entry and used them to "vote" for the classification (majority wins).
Then calculated the accuracy of that Forest as usual. In this way the classification of particular entries is rather stable and thus little variance in the forest accuracy.
Again – is it a good approach?

Best Answer

For small dataset, I think it'd be better to use Leave-one-out cross-validation (LOOCV). Say you have 1000 examples. Then each time you pull out one as a test example, and then use the rest 999 as the training set. So eventually, you'll have 1000 tests.

Related Question