The fundamental “problem” why bootstrap intervals tend to be too short

bootstrap

I found several posts which perform simulations and demonstrate that bootstrap intervals tend to be too short (even when accounting for the correct dependency/grouping structure). This is repeatedly asked and described in answers which point to the fact that bootstrap is only asymptotically valid.

E.g. this post and the discussions link a lot of questions
Is bootstrap problematic in small samples?

What I did not find is a discussion about the why.

Is the fundamental reason the following: By exchanging sampling from an infinite population with sampling from a finite population we are

  1. introducing artificial correlation between resamples (drawing with replacement) which was previously not there (since the samples were initially drawn at random)
  2. we have exchanged an infinite population with a finite population so that we have a similar effect like for a finite population $\operatorname{Var}\left( \frac1n \sum_i X_i \right) =\frac{1}{n}(1-\frac{n}{N}) \sigma^2$ which would be zero for n=N. So that the finite population (the sample from which we draw the resamples) just behaves differently from the true infinite population. But I cannot fully wrap my head around this. Maybe someone sees a connection to the altered finite population statistic which I quoted above…
  3. ….

Or is it a mix of both 1) and 2)?

Maybe discussing this why question with the min/max statistics would help? (I know common work arounds like $n$ out of $m$ bootstrap and I am not looking for adaptations of the vanilla bootstrap but for an answer WHY the vanilla bootstrap fails in certain situations).

Best Answer

A correct confidence interval (CI) at level 95%, say, for a parameter defined within a probability model has the property that if indeed observations are generated from that model, with any parameter value, the probability that the CI will catch the true parameter is 95%.

In order to prove theoretically that a CI is correct, we therefore need to assume that we know that data were generated from that model.

In nonparametric bootstrapping, we pretend that the data are generated from the empirical distribution of the data instead. This ignores the additional uncertainty that comes into the procedure from not knowing how well the data in fact represent the true underlying distribution. We pretend that it does, but (particularly with small samples) this may not necessarily be very good. Bootstrap CIs can be systematically too short because CIs are meant to capture the uncertainty in the data regarding the parameter estimator, but the bootstrap CI only accounts for the variation visible in the data, but not for potential additional variation between the empirical distribution of the data and the underlying true data generating process. (In many situations this vanishes for $n\to\infty$.)

(If I understand your points 1 and 2 well, it's neither of these.)