Solved – For bootstrapping, why does a higher subsample size lead to lower variance

bootstrapresamplingsample-size

I've been working on a bootstrapping problem that's left me a little confused and wondering whether I'm doing things correctly.

We have around 200 samples from a population of about 3,400, we want to bootstrap a value to estimate the total value for the 3,400 cases. Myself and a colleague both took slightly different approaches.

I took the 200 samples and randomly selected 3,400 observations to create a new sample the same size as the population, then got the sum of the 3,400 values. I repeated this 10,000 and took the average and the standard deviation of all 10,000 totals. This gave me an estimate for the total value with a 95% confidence interval.

My colleague did almost the exact same thing, but instead of taking a sub-sample of 3,400, his 10,000 subsamples each only had 200 in them. He got the average and standard deviation of the whole lot and multiplied it by 3,400 to get the estimate for the total.

When we compared results, we found we got the exact same answer for the estimate – which is good. However, the standard deviation from his method was much bigger.

From doing some research it seems like his method of resampling to the same number as the original sample is correct, but can anyone explain exactly why the difference in the standard deviation?

The difference got me wondering if this is how we calculate the standard deviation at all. Should we be calculating the standard deviation of each subsample 10,000 times and estimating that the same way as the sum?

Also, can anyone point to any resources/tutorials to clear things up?

Thanks!

Best Answer

First off, you should not resample a bootstrapped sample of size bigger than that of your original sample. So regardless of your population size, if your sample size is 200 you should not resample those values more than 200 times. In fact you should resample them precisely 200 times. So it's your friend who's got the correct results.

As for why your variance is lower, well that's because random index arrays of size 3400 are going to more closely follow a uniform distribution than random indexes of size 200. And the more uniform the random index distribution the more the bootstrapped distribution is going to resemble the original sample distribution. This means the bootstrapped mean values are also going to be much closer to the original sample mean value and as a result reduce overall variance in your results.

Related Solutions

Bootstrap – Is Bootstrapping Appropriate for This Continuous Data?

The standard deviation is as applicable here as anywhere else: it gives useful information about the dispersion of the data. In particular, the sd divided by the square root of the sample size is one standard error: it estimates the dispersion of the sampling distribution of the mean. Let's calculate:

$$3.2\% / \sqrt{10000} = 0.032\% = 0.00032.$$

That's tiny--far smaller than the $\pm 0.50\%$ precision you seek.

Although the data are not Normally distributed, the sample mean is extremely close to Normally distributed because the sample size is so large. Here, for instance, is a histogram of a sample with the same characteristics as yours and, at its right, the histogram of the means of a thousand additional samples from the same population.

It looks very close to Normal, doesn't it?

Thus, although it appears you are bootstrapping correctly, bootstrapping is not needed: a symmetric $100 - \alpha\%$ confidence interval for the mean is obtained, as usual, by multiplying the standard error by an appropriate percentile of the standard Normal distribution (to wit, $Z_{1-\alpha/200}$) and moving that distance to either side of the mean. In your case, $Z_{1-\alpha/200} = 2.5758$, so the $99\%$ confidence interval is

$$\left(0.977 - 2.5758(0.032) / \sqrt{10000},\ 0.977 + 2.5758(0.032) / \sqrt{10000}\right) \\ = \left(97.62\%, 97.78\%\right).$$

A sufficient sample size can be found by inverting this relationship to solve for the sample size. Here it tells us that you need a sample size around

$$(3.2\% / (0.5\% / Z_{1-\alpha/200}))^2 \approx 272.$$

This is small enough that we might want to re-check the conclusion that the sampling distribution of the mean is Normal. I drew a sample of $272$ from my population and bootstrapped its mean (for $9999$ iterations):

Sure enough, it looks Normal. In fact, the bootstrapped confidence interval of $(97.16\%, 98.21\%)$ is almost identical to the Normal-theory CI of $(97.19\%, 98.24\%)$.

As these examples show, the absolute sample size determines the accuracy of estimates rather than the proportion of the population size. (An extreme but intuitive example is that a single drop of seawater can provide an accurate estimate of the concentration of salt in the ocean, even though that drop is such a tiny fraction of all the seawater.) For your stated purposes, obtaining a sample of $10000$ (which requires more than $36$ times as much work as a sample of $272$) is overkill.

R code to perform these analyses and plot these graphics follows. It samples from a population having a Beta distribution with a mean of $0.977$ and SD of $0.032$.

set.seed(17)
#
# Study a sample of 10,000.
#
Sample <- rbeta(10^4, 20.4626, 0.4817)
hist(Sample)
hist(replicate(10^3, mean(rbeta(10^4, 20.4626, 0.4817))),xlab="%",main="1000 Sample Means")
#
# Analyze a sample designed to achieve a CI of width 1%.
#
(n.sample <- ceiling((0.032 / (0.005 / qnorm(1-0.005)))^2))
Sample <- rbeta(n.sample, 20.4626, 0.4817)
cat(round(mean(Sample), 3), round(sd(Sample), 3)) # Sample statistics
se.mean <- sd(Sample) / sqrt(length(Sample))      # Standard error of the mean
cat("CL: ", round(mean(Sample) + qnorm(0.005)*c(1,-1)*se.mean, 5)) # Normal CI
#
# Compare the bootstrapped CI of this sample.
#
Bootstrapped.means <- replicate(9999, mean(sample(Sample, length(Sample), replace=TRUE)))
hist(Bootstrapped.means)
cat("Bootstrap CL:", round(quantile(Bootstrapped.means, c(0.005, 1-0.005)), 5))

Bootstrap – How to Choose the Number of Bootstrap Resamples for Accurate Estimation

A bootstrap sample is usually taken to mean that the sample size of the resample is equal to the original sample size. What you are doing is to take resamples from the original sample with larger and larger (re)sample sizes. There is no reason to believe that this will represent the properties of the (original) sampling from the study population.

Say you are interested in the mean of some unknown distribution $F$ (on the real line, to make example specific). The mean (assuming it exists ) $\mu$ of the distribution $F$ is given by $$ \mu(F) = \int_{-\infty}^\infty x \; dF(x) $$ where the integral is a Stieltjes integral. If $F$ is the distribution of some continuous random variable with density $f(x) =F'(x)$ this is the usual integral $\int x f(x) \; dx$ but it also includes the discrete case. The point of writing the expectation in this unusual way is that we can see that the expectation is a functional of the distribution $F$, and also that it unifies the treatment of continuous/discrete cases.

Now we get a sample $x_1, x_2, \dotsc, x_N$ from $F$, and the idea behind bootstrapping is that we represent the distribution $F$ with the sample, and investigates sampling properties of estimators of $\mu$ by resampling from the sample. This makes clear that we need to assume that the sample is reasonably representative of $F$!, so we cannot expect this to work well with too small samples.

Now, our sample size was $N$, so we want properties of estimators of $\mu$ based on a sample of size $N$. Suppose we take resamples of size $n$ (possibly with $n \not = N$). Our resamples is a stand-in for samples from $F$ (that is the whole point with bootstrapping!). Suppose $F$ also has existing variance $\sigma^2$, and we estimate $\mu$ by the empirical mean $$ \bar{x}=\frac{1}{N}\sum_i x_i=\int_{-\infty}^\infty x \;d\hat{F}_N(x) $$ where $\hat{F}_n(x)$ is the empirical distribution function at $x$. Then the variance of this estimator will be $\sigma^2/N$. Lets say we do resampling but with resamples of size $n$. Then the empirical mean based on this resamples will have variance $\sigma^2(\hat{F}_N)/n$ where $\sigma^2(\hat{F}_N)$ is the variance based on the sample. If this empirical variance is a good estimator of $\sigma^2$, this will be approximately $\sigma^2/n$. If $n$ is different from $N$, this cannot be a good representation of the variance of $\bar{x}$, so will not tell you about the real uncertainty in $\bar{x}$ as an estimator of $\mu$.

EDIT

To clarify, the error in the results when using bootstrapping can be decomposed in the sampling error (due to only taking $N$ observations), and the bootstrap error (due to only taking $n < \infty$ resamples). By increasing $n$ we can reduce the later, but not the former.

Sometimes one is deliberately using a bootstrap sample size different from the original. See Can we use bootstrap samples that are smaller than original sample?, Subsample bootstrapping

Best Answer

Related Solutions

Bootstrap – Is Bootstrapping Appropriate for This Continuous Data?

Bootstrap – How to Choose the Number of Bootstrap Resamples for Accurate Estimation

Related Question