Bootstrap – How to Get Sampling Variance with Finite Population

bootstrappoststratification

This question is obviously similar to Suggestions for estimating variance in finite population with bootstrap?, but I'm asking about a different approach.

I have a (very finite) population of size N=832. I have sampled n=400 of these — a significant chunk of the entire population. I can estimate the variance of my estimator of the population mean the traditional, theoretical way. This includes a finite population correction factor (N-n)/N.

I can also use the bootstrap the common way to estimate the variance of the mean, that is

Resample with replacement from my sample.
Compute the mean for all bootstrap replications generated in step 1.
Compute the variance of the approximated sample distribution achieved in step 2.

This yields the variance for an infinite population — but I would like to account for the fact that I have sampled about half of the population.

My initial instinct was that I just multiply the bootstrap variance with the same finite population correction, but that seems a bit… iffy.

My second thought was to work closer to the fundamentals. What if I resample "the missing 432 values" from my sample, and then account for the fact that the sample variance of the mean of the sample of 400 I have is zero?

If we call the sample mean $m_s$ and the bootstrapped mean $m_b$ then I would compute the estimated population mean as

$$m_s \frac{n}{N} + m_b \frac{N-n}{N}$$

Then, unless I'm mistaken, I can proceed sort of as if this was a stratified sample, and compute the variance as

$$\left(\frac{n}{N}\right)^2 s_s^2 + \left(\frac{N-n}{N}\right)^2 s_b^2$$

where, in this case, $s_s^2 = 0$ because it's describing only itself, i.e. the sample is treated as the full population of that stratum.

That means the variance of the actual full population of 832 wouldn't be the full bootstrap variance, but rather $\left(\frac{N-n}{N}\right)^2$ of it.

Can I… do this? I'm mostly concerned about treating "the things that happened to end up in the sample" as a stratum after the fact. It sounds wrong.

What compounds the feeling of wrong is that the original sample mean and the bootstrap mean aren't really independent, are they? So there would be at least some sort of covariance term in there, would it? On the other hand, the entire premise of using the bootstrap is that both my original sample and the bootstrap replications are independent draws from the same underlying distribution.

Update

I may have overcomplicated things a bit in my question. A different way to approach this "partial bootstrap" idea is to draw the $N-n$ missing observations like in the question, and then fill up the rest of the bootstrap replication with the $n$ known observations, and then proceed as if it was a regular bootstrap replication.

This yields the same result as the stratified design above, and the value of the standard error with this approach is not unreasonable. However, it's still highly controversial in my head.

The conventional bootstrap is saying "if we repeated this sample many times, drawing from the distribution approximated by the initial sample, we would get values that vary this much."

What I'm proposing is closer to saying "if we did a census of the entire population, we would get this value… except HAHA we don't know what the values for this part of the population would be, but it could be… this, for example? or this? or this? How much does the result of that vary?"

That seems like a fundamentally sensible question in the same spirit, but I lack the rigour to prove it correct.

Best Answer

For a lack of a better way to do it, I tried this out myself with further simulation.

It does not work.

It wildly underestimates the size of the confidence interval for most of the usable range of sampling proportions.

In the following plot we look at the width of the 90 % confidence interval for the estimation of the population mean (the true population mean is 1.0), as a function of increased sampling proportion.

The black line is the width of the theoretical confidence interval based on normal standard error and with a finite population correction. The red line is the partial bootstrap confidence interval I'm attempting. Both have increased variance with smaller sampling proportions, which is right. But clearly the bootstrap one is not even close to right when sampling smaller and smaller proportions of the population.

I'm starting to suspect that the problem might be that the method I'm suggesting is bias-preserving in some sense. If I have a small sample (say 10 % of the population) and that happens to be biased, then if I draw the other 90 % of the population from that biased sample, the law of large numbers ensures that my population replication is almost guaranteed to be just as biased as my initial small sample.

The real bootstrap avoids this problem by only creating replications of the same, small, 10 % size. With these small samples, it's more likely that some of them contain only extremal values. In other words, the replication is not as guaranteed to share the bias of the original small sample.

If this is true, it's a neat design feature of the bootstrap that I had never considered: the size of the replication is picked so that the standard error has a chance of reflecting any bias of the sample.

Related Solutions

Bootstrap Variance – Detailed Analysis of Bootstrap Variance for Squared Sample Mean

A little late, but anyways... First, to simplify later calculations, rewrite the sample mean in terms of an expression containing the central moments under the empirical measure. Let $S_n = \frac{1}{n}\sum(X_i - \bar{X}_n) = 0$. Then $$ \bar{X}_n = S_n +\bar{X}_n = \frac{1}{n}\sum (X_i - \bar{X}_n) + \bar{X}_n $$ Now, Var$(\bar{X}_n^2) = E(\bar{X}_n^4) - (E\bar{X}_n^2)^2$. We'll tackle the first term. Note that $\bar{X}_n$ is the mean under the empirical measure, so we treat it as a constant when taking expectations. $$ \begin{align} E(\bar{X}_n^4) &= E(S_n + \bar{X}_n)^4 \\ &= E(S_n^4 + 4\bar{X}_nS_n^3 + 6\bar{X}_n^2S_n^2 + 4\bar{X}_n^3S_n + \bar{X}_n^4) \\ &= E(S_n^4) + 4\bar{X}_nE(S_n^3) + 6\bar{X}_n^2E(S_n^2) + \bar{X}_n^4 \end{align} $$ where we used that $S_n = 0$ to drop the second-to-last term. In the following expansions, terms involving a product with $nS_n$ will not be written. $$ \begin{align} E(S_n^4) &= E\left(\frac{1}{n^4}\left[\sum(X_i - \bar{X}_n)^4 + \sum\sum(X_i - \bar{X}_n)^2(X_j - \bar{X}_n)^2\right]\right) \\ &= \frac{\hat{a}_4}{n^3} + \frac{3(n-1)\hat{a}_2^2}{n^3}\\ E(S_n^3) &= \left(\frac{1}{n^3}\sum(X_i - \bar{X}_n)^3\right) = \frac{\hat{a}_3}{n^2}\\ E(S_n^2) &= \left(\frac{1}{n^2}\sum(X_i - \bar{X}_n)^2\right) = \frac{\hat{a}_2}{n} \end{align} $$ These are straightforward sums of products with some combinatorics to count the number of terms. Doing similar calculations for the second term of the variance and putting it all together: $$ Var(\bar{X}_n^2) = \frac{4\bar{X}_n^2\hat{a}_2}{n} + \frac{4\bar{X}_n\hat{a}_3}{n^2} + \frac{\hat{a}_4 + (2n - 3)\hat{a}_2^2}{n^3} $$

bootstrap – Calculating Bootstrap Confidence Interval and P-Values for Finite Populations

Pooling data is only allowed if you can reasonably make the assumption of equal distributions. For instance when the null hypothesis of equal medians is correct, but also other distribution parameters, like variance, should be the same.

By pooling the groups you will get a more precise estimate of the distribution of the statistic, because you are using a more precise estimate of the empirical distribution of the data (an estimate that improves when we have more datapoints).
The approach 2 without pooling the data also works if the two groups have different distributions.

With this method you do have to think about the interpretation of the distribution. Example with two beta distributions shifted such that their medians are 0:

I have chosen the parameters to create a difficult situation on purpose. Here the sampling distribution of the experiment has some skewness and the right tail is stretched out further than the left tail.

I also chose a random seed such that the outcome is far in the left tail. This situation shows that the bootstrap does mimic the skewness of the distribution, but as a hypothesis test, one should consider to shift the bootstrapped distribution to be centered around zero, instead of ateound the observed median. The probability that the bootstrapped sample has median zero or larger is different from the probability that the sampling distribution has the observed value or smaller.

Example code:

set.seed(2)
n = 31

### create some data from distributions with zero median
alpha=0.25
beta=2
x = rbeta(n,alpha,beta)-qbeta(0.5,alpha,beta)
y = rbeta(n,beta,alpha)-qbeta(0.5,beta,alpha)

### order the datapoints 
x = x[order(x)]
y = y[order(y)]

### bootstrapping based probability distribution of sampled medians 
k = 1:n
m = (n-1)/2
p = (1/n)*(k/n)^m*((n-k)/n)^m*factorial(n)/factorial(m)^2


### create tables for convolution
mS = outer(x,y,"-") # domain 
mP = outer(p,p,"*") # probabilities
### compute an estimate for density of median(x)-median(y)
f = density(mS, weights=mP, n=2/0.005, bw = 0.005, kernel = "rectangular" , from = -1, to = 1)
brks = seq(-1,1,0.005)

#### creating sampling distribution estimates
#### based on repeating the experiment 
experiment = function() {
  x = rbeta(n,alpha,beta)-qbeta(0.5,alpha,beta)
  y = rbeta(n,beta,alpha)-qbeta(0.5,beta,alpha)
  return(median(x)-median(y))
}
m_sample = replicate(10^5, experiment())


### plot histogram 

hist(m_sample, breaks = brks, xlim = c(-0.1,0.25), freq = 0, main = "estimate for density of median(x)-median(y) \n density curve based on bootstrap \n histogram based on re-sampling true distribution" , ylim =c(0,25))
lines(f)


### plotting other stuff

lines(c(1,1)*(median(x)-median(y)),c(0,25),lty=2,col =2)
text((median(x)-median(y)),15,"observed value",col =2,srt=90,pos =4)

Update

Best Answer

Related Solutions

Bootstrap Variance – Detailed Analysis of Bootstrap Variance for Squared Sample Mean

bootstrap – Calculating Bootstrap Confidence Interval and P-Values for Finite Populations

Related Question