Solved – When is bootstrapping helpful and used

bootstrapcomputational-statisticsconfidence intervalinferencesampling

When is bootstrapping helpful and when should it be used?

I have watched several videos so I understand what bootstrapping does (samples with replacement from a single sample many times and create a bootstrapped sampling distribution). If we are interested in means or difference in means, I understand that a t-test can be used and bootstrapping should be used for more unusual measures like medians (eg. median home price) or proportions (eg. proportion of voters that vote for a given party). To pinpoint why exactly bootstrapping is used for these cases, is it because a t-distribution does not accurately capture the distribution of the median home prices or proportion of voters? I have seen examples that use hypothesis testing and CIs for a proportion without using bootstrapping, so I wanted to clarify my understanding of why and when to use it.

Best Answer

I will use your t-test example to try to demonstrate why we would use bootstrapping in some cases but not in others.

The reason we use t-tests to compare sample means is because of the central limit theorem. The CLT states that if our sample ($X_1$, $X_2$, ...) meet certain conditions, then our sampling mean, $\bar{x}$, will follow a normal distribution. However, we don't know what the variance of the normal distribution is so we have to estimate it. Accounting for that uncertainty means that our sampling distribution of $\bar{x}$ will follow a t-distribution with n-1 degrees of freedom.

However, the Central Limit Theorem gives us no guarantees on distribution of the median. If our sample was n=5, the distribution of the median would be the distribution of $x_{(3)}$, the third order statistic, which would very much depend on the distribution of the $X_{i}$s, which we may not know. This is where bootstrapping comes in.

This is where bootstrapping comes in. We know that the median of our sample is going to be a good estimator of the population median, but we don't really have any idea how variable that sampling distribution is going to be (unless we know the distribution of the $X_i$s exactly). Bootstrapping is the easiest way to get an accurate estimator of the variation in the sampling distribution of any complex mathematical function of the data (such as the median, or any other statistic).

Bootstrapping is equally valid for use on the mean. However, we often prefer the t-distribution approach because bootstrapping is only an asymptotic approach, which means the guarantees that bootstrapping provides a good estimate of the variance only hold as your sample size goes to infinity. Often they work well for smaller samples, but imagine what bootstrapping would do if your sample size was 1, you're calculated statistic would always be the same and so your estimate of the variance would be 0. So clearly somewhere between 1 and infinity bootstrapping begins to give good results.