Solved – Increasing sample size with bootstrap sampling

bootstrapclassificationpredictive-modelssample-sizesmall-sample

I'm trying to perform some classification analyses with a relatively small dataset (201 observations, 32 predictors). There are 8 classes in my data with unequal sample sizes ranging from 10 in the least popular class to 43 in the most popular. Using CART and RF the classification performance is quite poor at ~50% for CART and ~65% for RF.

Out of interest I sampled with replacement 383 samples from the original 201 samples. The sample sizes for each class are still very unbalanced (20 in least popular, 77 in most popular class). I tested the bootstrap dataset with CART and RF and the performance of both classifiers is much better ~80% for CART and ~90% for RF. I've got two questions related to this:

1) Why does sampling with replacement increase the accuracy of predictive models when no "new" data is created i.e. all the extra samples come from the same original dataset, and the classes are still very unbalanced,

2) Is this a legitimate way to improve model performance if explained and compared to the original dataset?

There are a lot of questions on here about bootstrapping data, but I can't seem to find any related to classification accuracy in predictive models.

Here is an example using the iris dataset in R where the original dataset has an error rate of 4% compared to 0.33% for the bootstrap dataset.

library(randomForest)
data(iris)
set.seed(514); iris.boot <- iris[sample(nrow(iris),size=(nrow(iris)*2),replace=TRUE),]

iris.rf <- randomForest(Species ~ ., data = iris, ntree=500)
iris.rf

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          3        47        0.06


irisboot.rf <- randomForest(Species ~ ., data = iris.boot, ntree=500)
irisboot.rf

Call:
 randomForest(formula = Species ~ ., data = iris.boot, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 0.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         95          0         0 0.000000000
versicolor      0        107         1 0.009259259
virginica       0          0        97 0.000000000

Response to comments

In my actual work I'm using 10 x 10 CV to tune models and to hopefully minimise overfitting and reduce any optimistic bias in my model results – I'm also assessing the model performance using kappa and % agreement. Is this a better approach than using classification error by itself?

Also, you have said that using the bootstrap the way I have will result in model overfitting – which I believe would result in poor predictive accuracy on new samples not seen by the model during building? However, using the iris dataset again as an example I get a kappa of 1 and % agreement of 100% on "new" samples i.e. those that were not included in the bootstrap dataset

library(randomForest)
library(irr)

data(iris)
iris$ObsNumber <- 1:150

set.seed(514); iris.boot <- iris[sample(nrow(iris),size=(nrow(iris)*2),replace=TRUE),]

validation.set <- subset(iris, !(iris$ObsNumber %in% iris.boot$ObsNumber))

iris$ObsNumber <- NULL
    iris.boot$ObsNumber <- NULL
validation.set$ObsNumber <- NULL

iris.rf <- randomForest(Species ~ ., data=iris, ntree=500)
bootiris.rf <- randomForest(Species ~ ., data=iris.boot, ntree=500)

predictions <- predict(bootiris.rf, validation.set)

kappa2(data.frame(predictions, validation.set$Species))

 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 19 
   Raters = 2 
    Kappa = 1 

        z = 6.12 
  p-value = 9.23e-10 


agree(data.frame(predictions, validation.set$Species)) 


 Percentage agreement (Tolerance=0)

 Subjects = 19 
   Raters = 2 
  %-agree = 100 

Also, is there an easy way to implement the Effron-Gong bootstrap you mention with randomForests?

Best Answer

There are several issues:

  • The sample size is far too low to reliably do what you are attempting
  • Classification error is an improper scoring rule that is optimized by an incorrect model with incorrect features and incorrect weights
  • You are using the bootstrap incorrectly. The bootstrap, relying on samples with replacement, results in duplications of observations that increases the amount of overfitting.
  • With the more appropriate Efron-Gong optimism bootstrap, used to estimate the drop-off in predictive performance so as to get overfitting-corrected estimates of predictive accuracy, the philosophy is that one attempts to estimate the difference in predictive accuracy of the fitted model evaluated on the training data and the true unknown predictive accuracy. The bootstrap estimates this because this difference (amount of overfitting) can be estimated by the difference between super-overfitting (evaluate accuracy on a bootstrap sample) and regular overfitting (evaluate accuracy of the model fitted on the bootstrap sample on the original sample).