Solved – For bootstrapping, why does a higher subsample size lead to lower variance

bootstrapresamplingsample-size

I've been working on a bootstrapping problem that's left me a little confused and wondering whether I'm doing things correctly.

We have around 200 samples from a population of about 3,400, we want to bootstrap a value to estimate the total value for the 3,400 cases. Myself and a colleague both took slightly different approaches.

I took the 200 samples and randomly selected 3,400 observations to create a new sample the same size as the population, then got the sum of the 3,400 values. I repeated this 10,000 and took the average and the standard deviation of all 10,000 totals. This gave me an estimate for the total value with a 95% confidence interval.

My colleague did almost the exact same thing, but instead of taking a sub-sample of 3,400, his 10,000 subsamples each only had 200 in them. He got the average and standard deviation of the whole lot and multiplied it by 3,400 to get the estimate for the total.

When we compared results, we found we got the exact same answer for the estimate – which is good. However, the standard deviation from his method was much bigger.

From doing some research it seems like his method of resampling to the same number as the original sample is correct, but can anyone explain exactly why the difference in the standard deviation?

The difference got me wondering if this is how we calculate the standard deviation at all. Should we be calculating the standard deviation of each subsample 10,000 times and estimating that the same way as the sum?

Also, can anyone point to any resources/tutorials to clear things up?

Thanks!

Best Answer

First off, you should not resample a bootstrapped sample of size bigger than that of your original sample. So regardless of your population size, if your sample size is 200 you should not resample those values more than 200 times. In fact you should resample them precisely 200 times. So it's your friend who's got the correct results.

As for why your variance is lower, well that's because random index arrays of size 3400 are going to more closely follow a uniform distribution than random indexes of size 200. And the more uniform the random index distribution the more the bootstrapped distribution is going to resemble the original sample distribution. This means the bootstrapped mean values are also going to be much closer to the original sample mean value and as a result reduce overall variance in your results.

Related Question