Solved – Is it appropriate to use bootstrapping to measure variance

bootstrapsimulation

So, I always thought the idea of bootstrapping was that you have a sample from which you obtain an estimator for some function of the population (like the average height). And then when you bootstrap by resampling, you get draws from the distribution of the estimator (and hence the variance). I took it for granted that the mean across the bootstrapped samples would be the same as the mean from the original sample. This is definitely true for statistics such as any average across the population.

For any nonlinear function, however, such as the percentage of "rich people" in the sample, the two means will be different. So, bootstrapping is in effect telling you that your original estimator has a different mean now (which is in most cases also the mode).

Given this bias, is it still appropriate to use bootstrapping to measure variance (you obviously won't use the bootstrapped mean in place of the original estimate)?

Best Answer

Rather than representing problem in the bootstrap, this feature is sometimes used to estimate the bias in your original estimator, see for example chapter 10 of Bradley Efron and Robert Tibshirani (1993) "An Introduction to the Bootstrap". Chapman & Hall/CRC.

Related Solutions

Solved – For bootstrapping, why does a higher subsample size lead to lower variance

First off, you should not resample a bootstrapped sample of size bigger than that of your original sample. So regardless of your population size, if your sample size is 200 you should not resample those values more than 200 times. In fact you should resample them precisely 200 times. So it's your friend who's got the correct results.

As for why your variance is lower, well that's because random index arrays of size 3400 are going to more closely follow a uniform distribution than random indexes of size 200. And the more uniform the random index distribution the more the bootstrapped distribution is going to resemble the original sample distribution. This means the bootstrapped mean values are also going to be much closer to the original sample mean value and as a result reduce overall variance in your results.

Bootstrap – Is Bootstrapping Appropriate for This Continuous Data?

The standard deviation is as applicable here as anywhere else: it gives useful information about the dispersion of the data. In particular, the sd divided by the square root of the sample size is one standard error: it estimates the dispersion of the sampling distribution of the mean. Let's calculate:

$$3.2\% / \sqrt{10000} = 0.032\% = 0.00032.$$

That's tiny--far smaller than the $\pm 0.50\%$ precision you seek.

Although the data are not Normally distributed, the sample mean is extremely close to Normally distributed because the sample size is so large. Here, for instance, is a histogram of a sample with the same characteristics as yours and, at its right, the histogram of the means of a thousand additional samples from the same population.

It looks very close to Normal, doesn't it?

Thus, although it appears you are bootstrapping correctly, bootstrapping is not needed: a symmetric $100 - \alpha\%$ confidence interval for the mean is obtained, as usual, by multiplying the standard error by an appropriate percentile of the standard Normal distribution (to wit, $Z_{1-\alpha/200}$) and moving that distance to either side of the mean. In your case, $Z_{1-\alpha/200} = 2.5758$, so the $99\%$ confidence interval is

$$\left(0.977 - 2.5758(0.032) / \sqrt{10000},\ 0.977 + 2.5758(0.032) / \sqrt{10000}\right) \\ = \left(97.62\%, 97.78\%\right).$$

A sufficient sample size can be found by inverting this relationship to solve for the sample size. Here it tells us that you need a sample size around

$$(3.2\% / (0.5\% / Z_{1-\alpha/200}))^2 \approx 272.$$

This is small enough that we might want to re-check the conclusion that the sampling distribution of the mean is Normal. I drew a sample of $272$ from my population and bootstrapped its mean (for $9999$ iterations):

Sure enough, it looks Normal. In fact, the bootstrapped confidence interval of $(97.16\%, 98.21\%)$ is almost identical to the Normal-theory CI of $(97.19\%, 98.24\%)$.

As these examples show, the absolute sample size determines the accuracy of estimates rather than the proportion of the population size. (An extreme but intuitive example is that a single drop of seawater can provide an accurate estimate of the concentration of salt in the ocean, even though that drop is such a tiny fraction of all the seawater.) For your stated purposes, obtaining a sample of $10000$ (which requires more than $36$ times as much work as a sample of $272$) is overkill.

R code to perform these analyses and plot these graphics follows. It samples from a population having a Beta distribution with a mean of $0.977$ and SD of $0.032$.

set.seed(17)
#
# Study a sample of 10,000.
#
Sample <- rbeta(10^4, 20.4626, 0.4817)
hist(Sample)
hist(replicate(10^3, mean(rbeta(10^4, 20.4626, 0.4817))),xlab="%",main="1000 Sample Means")
#
# Analyze a sample designed to achieve a CI of width 1%.
#
(n.sample <- ceiling((0.032 / (0.005 / qnorm(1-0.005)))^2))
Sample <- rbeta(n.sample, 20.4626, 0.4817)
cat(round(mean(Sample), 3), round(sd(Sample), 3)) # Sample statistics
se.mean <- sd(Sample) / sqrt(length(Sample))      # Standard error of the mean
cat("CL: ", round(mean(Sample) + qnorm(0.005)*c(1,-1)*se.mean, 5)) # Normal CI
#
# Compare the bootstrapped CI of this sample.
#
Bootstrapped.means <- replicate(9999, mean(sample(Sample, length(Sample), replace=TRUE)))
hist(Bootstrapped.means)
cat("Bootstrap CL:", round(quantile(Bootstrapped.means, c(0.005, 1-0.005)), 5))

Best Answer

Related Solutions

Solved – For bootstrapping, why does a higher subsample size lead to lower variance

Bootstrap – Is Bootstrapping Appropriate for This Continuous Data?

Related Question