Solved – Increasing sample size with bootstrap sampling

bootstrapclassificationpredictive-modelssample-sizesmall-sample

I'm trying to perform some classification analyses with a relatively small dataset (201 observations, 32 predictors). There are 8 classes in my data with unequal sample sizes ranging from 10 in the least popular class to 43 in the most popular. Using CART and RF the classification performance is quite poor at ~50% for CART and ~65% for RF.

Out of interest I sampled with replacement 383 samples from the original 201 samples. The sample sizes for each class are still very unbalanced (20 in least popular, 77 in most popular class). I tested the bootstrap dataset with CART and RF and the performance of both classifiers is much better ~80% for CART and ~90% for RF. I've got two questions related to this:

1) Why does sampling with replacement increase the accuracy of predictive models when no "new" data is created i.e. all the extra samples come from the same original dataset, and the classes are still very unbalanced,

2) Is this a legitimate way to improve model performance if explained and compared to the original dataset?

There are a lot of questions on here about bootstrapping data, but I can't seem to find any related to classification accuracy in predictive models.

Here is an example using the iris dataset in R where the original dataset has an error rate of 4% compared to 0.33% for the bootstrap dataset.

library(randomForest)
data(iris)
set.seed(514); iris.boot <- iris[sample(nrow(iris),size=(nrow(iris)*2),replace=TRUE),]

iris.rf <- randomForest(Species ~ ., data = iris, ntree=500)
iris.rf

Call:
 randomForest(formula = Species ~ ., data = iris, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          3        47        0.06


irisboot.rf <- randomForest(Species ~ ., data = iris.boot, ntree=500)
irisboot.rf

Call:
 randomForest(formula = Species ~ ., data = iris.boot, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 0.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         95          0         0 0.000000000
versicolor      0        107         1 0.009259259
virginica       0          0        97 0.000000000

Response to comments

In my actual work I'm using 10 x 10 CV to tune models and to hopefully minimise overfitting and reduce any optimistic bias in my model results – I'm also assessing the model performance using kappa and % agreement. Is this a better approach than using classification error by itself?

Also, you have said that using the bootstrap the way I have will result in model overfitting – which I believe would result in poor predictive accuracy on new samples not seen by the model during building? However, using the iris dataset again as an example I get a kappa of 1 and % agreement of 100% on "new" samples i.e. those that were not included in the bootstrap dataset

library(randomForest)
library(irr)

data(iris)
iris$ObsNumber <- 1:150

set.seed(514); iris.boot <- iris[sample(nrow(iris),size=(nrow(iris)*2),replace=TRUE),]

validation.set <- subset(iris, !(iris$ObsNumber %in% iris.boot$ObsNumber))

iris$ObsNumber <- NULL
    iris.boot$ObsNumber <- NULL
validation.set$ObsNumber <- NULL

iris.rf <- randomForest(Species ~ ., data=iris, ntree=500)
bootiris.rf <- randomForest(Species ~ ., data=iris.boot, ntree=500)

predictions <- predict(bootiris.rf, validation.set)

kappa2(data.frame(predictions, validation.set$Species))

 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 19 
   Raters = 2 
    Kappa = 1 

        z = 6.12 
  p-value = 9.23e-10 


agree(data.frame(predictions, validation.set$Species)) 


 Percentage agreement (Tolerance=0)

 Subjects = 19 
   Raters = 2 
  %-agree = 100

Also, is there an easy way to implement the Effron-Gong bootstrap you mention with randomForests?

Best Answer

There are several issues:

The sample size is far too low to reliably do what you are attempting
Classification error is an improper scoring rule that is optimized by an incorrect model with incorrect features and incorrect weights
You are using the bootstrap incorrectly. The bootstrap, relying on samples with replacement, results in duplications of observations that increases the amount of overfitting.
With the more appropriate Efron-Gong optimism bootstrap, used to estimate the drop-off in predictive performance so as to get overfitting-corrected estimates of predictive accuracy, the philosophy is that one attempts to estimate the difference in predictive accuracy of the fitted model evaluated on the training data and the true unknown predictive accuracy. The bootstrap estimates this because this difference (amount of overfitting) can be estimated by the difference between super-overfitting (evaluate accuracy on a bootstrap sample) and regular overfitting (evaluate accuracy of the model fitted on the bootstrap sample on the original sample).

Related Solutions

Solved – Calculating necessary sample size using bootstrap

Ok, so this answer might not be exactly what you were after based on the detail of your question, but I stumbled across your question based on just the title and so this might help other people who also come across it in a similar fashion.

The only way I know of determining sample size using a bootstrap is via a power analysis approach. That is you:

State the null hypothesis and alternative hypothesis
State the alpha level (typically 5%)
If necessary shift the pilot study data so that you know the null hypothesis is false
Re-sample with replacements from the pilot study
Perform the test on the this sample and record the result
Repeat 1000 or so times to build up probability distribution
Count how many times the null hypothesis is rejected

With many possible "variations on a theme of..."

And that gives you the statistical power (for that sample size and that particular test), because the definition of statistical power is "probability that the test will reject the null hypothesis when the alternative hypothesis is true". So you can then vary the sample size until you achieve the desired power.

Here's an approach in R that I did based on this paper, Sample Size / Power Considerations, by Elizabeth Colantuoni.

I had two groups of non-normal, non-parametric data. A pilot study of each showed them to have differing medians and a Mann Whitney Wilcoxon test rejected the null hypothesis that they were the same, but I wanted to determine the sample size required so I could say this for "sure". Since the test already rejected the null hypothesis on the pilot data I did not see any need to shift or manipulate the data to ensure the alternative hypothesis was true.

power = function(group1.pilot, group2.pilot, reps=1000, size=10) {
    results  <- sapply(1:reps, function(r) {
        group1.resample <- sample(group1.pilot, size=size, replace=TRUE) 
        group2.resample <- sample(group2.pilot, size=size, replace=TRUE) 
        test <- wilcox.test(group1.resample, group2.resample, paired=FALSE)
        test$p.value
    })
    sum(results<0.05)/reps
}

#Find power for a sample size of 100
power(data1, data2, reps=1000, size=100)

Necessary disclaimer: I'm not a statistician and I'm still learning about bootstrapping so feedback, corrections and pointing and laughing are welcome.

Bootstrap Methodology – How It Addresses Small Sample Size Issues

I remember reading that using the percentile confidence interval for bootstrapping is equivalent to using a Z interval instead of a T interval and using $n$ instead of $n-1$ for the denominator. Unfortunately I don't remember where I read this and could not find a reference in my quick searches. These differences don't matter much when n is large (and the advantages of the bootstrap outweigh these minor problems when $n$ is large), but with small $n$ this can cause problems. Here is some R code to simulate and compare:

simfun <- function(n=5) {
    x <- rnorm(n)
    m.x <- mean(x)
    s.x <- sd(x)
    z <- m.x/(1/sqrt(n))
    t <- m.x/(s.x/sqrt(n))
    b <- replicate(10000, mean(sample(x, replace=TRUE)))
    c( t=abs(t) > qt(0.975,n-1), z=abs(z) > qnorm(0.975),
        z2 = abs(t) > qnorm(0.975), 
        b= (0 < quantile(b, 0.025)) | (0 > quantile(b, 0.975))
     )
}

out <- replicate(10000, simfun())
rowMeans(out)

My results for one run are:

     t      z     z2 b.2.5% 
0.0486 0.0493 0.1199 0.1631

So we can see that using the t-test and the z-test (with the true population standard deviation) both give a type I error rate that is essentially $\alpha$ as designed. The improper z test (dividing by sample standard deviation, but using Z critical value instead of T) rejects the null more than twice as often as it should. Now to the bootstrap, it is rejecting the null 3 times as often as it should (looking if 0, the true mean, is in the interval or not), so for this small sample size the simple bootstrap is not sized properly and therefore does not fix problems (and this is when the data is optimally normal). The improved bootstrap intervals (BCa etc.) will probably do better, but this should raise some concern about using bootstrapping as a panacea for small sample sizes.

Best Answer

Related Solutions

Solved – Calculating necessary sample size using bootstrap

Bootstrap Methodology – How It Addresses Small Sample Size Issues

Related Question