Solved – Is Random Forest suitable for very small data sets

random forestsmall-sample

I have data set comprising 24 rows of monthly data. The features are GDP, airport arrivals, month, and a few others. The dependent variable is number of visitors to a popular tourism destination. Would Random Forest be suitable for such a problem?

The data are non public so I am unable to post a sample.

Best Answer

Random forest is basically bootstrap resampling and training decision trees on the samples, so the answer to your question needs to address those two.

Bootstrap resampling is not a cure for small samples. If you have just twenty four observations in your dataset, then each of the samples taken with replacement from this data would consist of not more than the twenty four distinct values. Shuffling the cases and not drawing some of them would not change much about your ability to learn anything new about the underlying distribution. So a small sample is a problem for bootstrap.

Decision trees are trained by splitting the data conditionally on the predictor variables, one variable at a time, to find such subsamples that have greatest discriminatory power. If you have only twenty four cases, then say that if you were lucky and all the splits were even in size, then with two splits you would end up with four groups of six cases, with tree splits, with eight groups of three. If you calculated conditional means on the samples (to predict continuous values in regression trees, or conditional probabilities in decision trees), you would base your conclusion only on those few cases! So the sub-samples that you would use to make the decisions would be even smaller than your original data.

With small samples it is usually wise to use simple methods. Moreover, you can catch up the small sample by using informative priors in Bayesian setting (if you have any reasonable out-of-data knowledge about the problem), so you could consider using some tailor-made Bayesian model.

Related Question