Bootstrap – Why Bootstrapping is Not Done in the Following Manner

bootstrapintuition

I'm under the impression that when you bootstrap, your final results are the original statistic from your sample data, and the standard errors from the bootstrapped trials. However, it seems more intuitive to take the mean statistic from all your trials, rather than just the statistic from the original trial. Is there some statistical intuition why it is one and not the other?

Also, I came across a use case where someone uses bootstrapping using the mean as the statistic. They did their sampling, took the mean of each trial, and used that to calculate the confidence interval around the mean. Is this ok? It seems like you could draw confidence intervals using the original data itself, and bootstrapping would artificially lower the standard errors. Again, is there some intuition I could use to understand why this is ok/not ok?

Best Answer

The idea of the bootstrap is to estimate the sampling distribution of your estimate without making actual assumptions about the distribution of your data.

You usually go for the sampling distribution when you are after the estimates of the standard error and/or confidence intervals. However, your point estimate is fine. Given your data set and without knowing the distribution, the sample mean is still a very good guess about the central tendency of your data. Now, what about the standard error? The bootstrap is a good way getting that estimate without imposing a probabilistic distribution for data.

More technically, when building a standard error for a generic statistic, if you knew the sampling distribution of your estimate $\hat \theta$ is $F$, and you wanted to see how far you can be from it's mean $\mu$, the quantity $\hat \theta$ estimates, you could look at the differences from the mean of the sampling distribution $\mu$, namely $\delta$, and make that the focus of your analysis, not $\hat \theta$

$$ \delta = \hat \theta - \mu $$

Now, since we know that $\hat \theta \sim F$, when know that $\delta$ should be related with $F$ minus the constant $\mu$. A type of "standardization" as we do with the normal distribution. And with that in mind, just compute the 80% confidence interval such that

$$ P_F(\delta_{.9} \le \hat \theta - \mu \le \delta_{.1} | \mu) = 0.8 \leftrightarrow P_F(\hat \theta - \delta_{.9} \ge \mu \ge \ \hat \theta - delta_{.1} | \mu) = 0.8 $$

So we just build the CI as $\left[\hat \theta - \delta_{.1}, \hat \theta - \delta_{.9} \right]$. Keep in mind that we don't know $F$ so we cant know $\delta_{.1}$ or $\delta_{.9}$. And we don't want to assume that it is normal and just look at the percentiles of a standard normal distribution either.

The bootstrap principle helps us estimate the sampling distribution $F$ by resampling our data. Our point estimate will be forever $\hat \theta$. There isn't anything wrong with it. But if I take another resample I can built $\hat \theta^*_1 $. And then another resmple $\hat \theta^*_2 $. And then another $\hat \theta^*_3 $. I think you get the idea.

With a set of estimates $\hat \theta^*_1 ... \hat \theta^*_n$ has a distribution $F^*$ which approximates $F$. We can then compute $$ \delta^*_i = \hat \theta^*_i - \hat \theta $$

Notice that the point estimate for the $\mu$ is replaced by our best guess $\hat \theta$. And look at the empirical distribution of $\theta^*$ to compute $\left[\hat \theta - \delta^*_{.1}, \hat \theta - \delta^*_{.9} \right]$.

Now, this explanation is heavily based on this MIT class on the bootstrap. I highly recommend you give it a read.