Bootstrap – How to Get Sampling Variance with Finite Population

bootstrappoststratification

This question is obviously similar to Suggestions for estimating variance in finite population with bootstrap?, but I'm asking about a different approach.

I have a (very finite) population of size N=832. I have sampled n=400 of these — a significant chunk of the entire population. I can estimate the variance of my estimator of the population mean the traditional, theoretical way. This includes a finite population correction factor (N-n)/N.

I can also use the bootstrap the common way to estimate the variance of the mean, that is

  1. Resample with replacement from my sample.

  2. Compute the mean for all bootstrap replications generated in step 1.

  3. Compute the variance of the approximated sample distribution achieved in step 2.

This yields the variance for an infinite population — but I would like to account for the fact that I have sampled about half of the population.

My initial instinct was that I just multiply the bootstrap variance with the same finite population correction, but that seems a bit… iffy.

My second thought was to work closer to the fundamentals. What if I resample "the missing 432 values" from my sample, and then account for the fact that the sample variance of the mean of the sample of 400 I have is zero?

If we call the sample mean $m_s$ and the bootstrapped mean $m_b$ then I would compute the estimated population mean as

$$m_s \frac{n}{N} + m_b \frac{N-n}{N}$$

Then, unless I'm mistaken, I can proceed sort of as if this was a stratified sample, and compute the variance as

$$\left(\frac{n}{N}\right)^2 s_s^2 + \left(\frac{N-n}{N}\right)^2 s_b^2$$

where, in this case, $s_s^2 = 0$ because it's describing only itself, i.e. the sample is treated as the full population of that stratum.

That means the variance of the actual full population of 832 wouldn't be the full bootstrap variance, but rather $\left(\frac{N-n}{N}\right)^2$ of it.

Can I… do this? I'm mostly concerned about treating "the things that happened to end up in the sample" as a stratum after the fact. It sounds wrong.

What compounds the feeling of wrong is that the original sample mean and the bootstrap mean aren't really independent, are they? So there would be at least some sort of covariance term in there, would it? On the other hand, the entire premise of using the bootstrap is that both my original sample and the bootstrap replications are independent draws from the same underlying distribution.


Update

I may have overcomplicated things a bit in my question. A different way to approach this "partial bootstrap" idea is to draw the $N-n$ missing observations like in the question, and then fill up the rest of the bootstrap replication with the $n$ known observations, and then proceed as if it was a regular bootstrap replication.

This yields the same result as the stratified design above, and the value of the standard error with this approach is not unreasonable. However, it's still highly controversial in my head.

The conventional bootstrap is saying "if we repeated this sample many times, drawing from the distribution approximated by the initial sample, we would get values that vary this much."

What I'm proposing is closer to saying "if we did a census of the entire population, we would get this value… except HAHA we don't know what the values for this part of the population would be, but it could be… this, for example? or this? or this? How much does the result of that vary?"

That seems like a fundamentally sensible question in the same spirit, but I lack the rigour to prove it correct.

Best Answer

For a lack of a better way to do it, I tried this out myself with further simulation.

It does not work.

It wildly underestimates the size of the confidence interval for most of the usable range of sampling proportions.

In the following plot we look at the width of the 90 % confidence interval for the estimation of the population mean (the true population mean is 1.0), as a function of increased sampling proportion.

CI width as function of sampling proportion

The black line is the width of the theoretical confidence interval based on normal standard error and with a finite population correction. The red line is the partial bootstrap confidence interval I'm attempting. Both have increased variance with smaller sampling proportions, which is right. But clearly the bootstrap one is not even close to right when sampling smaller and smaller proportions of the population.


I'm starting to suspect that the problem might be that the method I'm suggesting is bias-preserving in some sense. If I have a small sample (say 10 % of the population) and that happens to be biased, then if I draw the other 90 % of the population from that biased sample, the law of large numbers ensures that my population replication is almost guaranteed to be just as biased as my initial small sample.

The real bootstrap avoids this problem by only creating replications of the same, small, 10 % size. With these small samples, it's more likely that some of them contain only extremal values. In other words, the replication is not as guaranteed to share the bias of the original small sample.

If this is true, it's a neat design feature of the bootstrap that I had never considered: the size of the replication is picked so that the standard error has a chance of reflecting any bias of the sample.

Related Question