Predictive Models – Finding the Best Model to Predict Small Datasets

predictive-modelssmall-sample

This is my first prediction task. I have a (very) small dataset with 59 rows and 3 features. I hope that this are not way too little samples.

group  size  weight
1      180   85
2      190   98
3      195   110
2      188   90
...

I want to predict the feature "group" based on the features "size" and "weight". I have built a classification tree, random forest and a neural network. Therefore I used 50 samples as training set and 9 samples as test set.

Classification Tree: 7/9 correct predictions –> success rate: 77.78%

treeModel <- tree(as.factor(group)~size+weight, data[1:50,])
treePrediction <- predict(treeModel, newdata = data[51:59,], type = 'class')

Random Forest: 5/9 correct predictions –> success rate: 55.56%

forestModel <- randomForest(as.factor(group)~size+weight, data[1:50,],mtry=2)
forestPrediction <- predict(forestModel, newdata = data[51:59,], type = 'class')

Neural Network: 8/9 correct predictions –> success rate: 88.89%

nnModel <- nnet(as.factor(group)~size+weight, data[1:50,], size=15)
nnPrediction <- predict(nnModel, newdata = data[51:59,], type = 'class')
  • How can I evaluate the best model out of this three?
  • Why is the result of a random forest worse than of a single tree?
  • How can I improve the results?

Best Answer

In that case (small sample), you should probably try a leave-one-out cross-validation. Using a single validation sample implies a big variance on the estimation of the performance of your model. Therefore, the results are not reliable and the decision tree is better than the random forest happens by pure chance.

Leave-one-out cross-validation consists in dropping one variable and use the model train on the $n-1$ variables. Then you observe the difference between the predicted and the actual value of the remaining element.

You performance will be evaluated on 59 elements, and not on 9 elements.

Related Question