Solved – Train/test split for a small dataset (classification tree/ random forest)

I'm trying to solve a classification problem using Classification Tree with a small amount of data available. Depending on the content of particular train/test subsets (80:20 proportion) the total accuracy varies much between the runs (I'm observing values about 40% but values like 20% or 75% are not uncommon). This makes it hard to make a decision if the parameters of the tree are chosen well (e.g. the independent variables, max tree height, etc.).

Therfore I decided to run e.g. 1000 such a trees and average the accuracies to estimate the mean accuracy (i.e. the accuracy I would obtain using large train and test subsets).
Is it a good approach?

Also, I have build a Random Forest in a similar way. I have split the dataset 1000 times into different train/test subsets (again 80:20), built 1000 trees and for each entry in my dataset found those ~200 trees that have not been trained with that particular entry and used them to "vote" for the classification (majority wins).
Then calculated the accuracy of that Forest as usual. In this way the classification of particular entries is rather stable and thus little variance in the forest accuracy.
Again – is it a good approach?

Solved – Train/test split for a small dataset (classification tree/ random forest)

Best Answer

Related Question

Best Answer

Related Solutions

Solved – Random forest – binary classification vs. regression

RandomForest Trees Voting – How to Make the RandomForest Trees Vote Decimals Instead of Binary?

Related Question