I'm trying to perform some classification analyses with a relatively small dataset (201 observations, 32 predictors). There are 8 classes in my data with unequal sample sizes ranging from 10 in the least popular class to 43 in the most popular. Using CART and RF the classification performance is quite poor at ~50% for CART and ~65% for RF.
Out of interest I sampled with replacement 383 samples from the original 201 samples. The sample sizes for each class are still very unbalanced (20 in least popular, 77 in most popular class). I tested the bootstrap dataset with CART and RF and the performance of both classifiers is much better ~80% for CART and ~90% for RF. I've got two questions related to this:
1) Why does sampling with replacement increase the accuracy of predictive models when no "new" data is created i.e. all the extra samples come from the same original dataset, and the classes are still very unbalanced,
2) Is this a legitimate way to improve model performance if explained and compared to the original dataset?
There are a lot of questions on here about bootstrapping data, but I can't seem to find any related to classification accuracy in predictive models.
Here is an example using the iris dataset in R where the original dataset has an error rate of 4% compared to 0.33% for the bootstrap dataset.
library(randomForest)
data(iris)
set.seed(514); iris.boot <- iris[sample(nrow(iris),size=(nrow(iris)*2),replace=TRUE),]
iris.rf <- randomForest(Species ~ ., data = iris, ntree=500)
iris.rf
Call:
randomForest(formula = Species ~ ., data = iris, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 4%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 3 47 0.06
irisboot.rf <- randomForest(Species ~ ., data = iris.boot, ntree=500)
irisboot.rf
Call:
randomForest(formula = Species ~ ., data = iris.boot, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 0.33%
Confusion matrix:
setosa versicolor virginica class.error
setosa 95 0 0 0.000000000
versicolor 0 107 1 0.009259259
virginica 0 0 97 0.000000000
Response to comments
In my actual work I'm using 10 x 10 CV to tune models and to hopefully minimise overfitting and reduce any optimistic bias in my model results – I'm also assessing the model performance using kappa and % agreement. Is this a better approach than using classification error by itself?
Also, you have said that using the bootstrap the way I have will result in model overfitting – which I believe would result in poor predictive accuracy on new samples not seen by the model during building? However, using the iris dataset again as an example I get a kappa of 1 and % agreement of 100% on "new" samples i.e. those that were not included in the bootstrap dataset
library(randomForest)
library(irr)
data(iris)
iris$ObsNumber <- 1:150
set.seed(514); iris.boot <- iris[sample(nrow(iris),size=(nrow(iris)*2),replace=TRUE),]
validation.set <- subset(iris, !(iris$ObsNumber %in% iris.boot$ObsNumber))
iris$ObsNumber <- NULL
iris.boot$ObsNumber <- NULL
validation.set$ObsNumber <- NULL
iris.rf <- randomForest(Species ~ ., data=iris, ntree=500)
bootiris.rf <- randomForest(Species ~ ., data=iris.boot, ntree=500)
predictions <- predict(bootiris.rf, validation.set)
kappa2(data.frame(predictions, validation.set$Species))
Cohen's Kappa for 2 Raters (Weights: unweighted)
Subjects = 19
Raters = 2
Kappa = 1
z = 6.12
p-value = 9.23e-10
agree(data.frame(predictions, validation.set$Species))
Percentage agreement (Tolerance=0)
Subjects = 19
Raters = 2
%-agree = 100
Also, is there an easy way to implement the Effron-Gong bootstrap you mention with randomForests?
Best Answer
There are several issues: