Bootstrapping and Central Limit Theorem – Probability and Statistics

pr.probabilityprobability distributionsst.statistics

I have been looking into bootstrapping lately and although I believe to have understood the basic process somewhat, I am fuzzy on the mathematical details. I will begin with my understanding of what bootstrapping is and then my understanding of the mathematics going on in the background. I might very well be mistaken on either one.

Bootstrapping (as I understand it):

The idea behind (most) bootstrapping is to study a random variable $X$ and its (unknown) distribution by resampling a sample $x=(x_1, …, x_n)$ of $X$. Meaning, we choose randomly $n$ times an entry of $x$ to create a new sample $x^{(1)}=(x_1^{(1)}, …, x_n^{(1)})$. This way, we create a bunch of samples $(x^{(i)})_i$.
For each of these samples, we can now compute, for example, the mean and plot these means into a histogram. This histogram can be normalized to give us a distribution of the mean across all new samples. This process could of course be applied to any other statistic we might want to compute for any of the samples, giving us some histogram each time.

Mathematical model (as I understand it):

Let $X$ be a random variable with unknown distribution $f_X$. We have a sample $x=(x_1, …, x_n)$ of $X$. This sample defines a (discrete) empirical distribution $f_x$. We can now define $n$ independent variables $Y_1, …, Y_n \sim f_x$. Calculating the mean of a resample of $x$ would now be equivalent to sampling the random variable $X' = \frac{1}{n}(Y_1 + … + Y_n)$. Sampling $X'$ a lot, producing and normalizing a histogram would tend towards the distribution of $\bar X = \frac{1}{m}(X'_1 + … + X'_m)$ as $m$ tends to infinity where $X'_1, …, X'_m$ are independent and distributed as $X'$. The CLT now tells us that this would be a Gaussian distribution.

Barring any mistakes I have made here, I have two questions:

  1. Am I correct here? Does the histogram of the mean of bootstrapped samples really resemble a Gaussian distribution? (In the sense that we can normalize the histogram and take the limit.)
  2. What if we replace the mean by any function $g$ on $X_1, …, X_n$ that produces different $X'$, e.g. $g=\min$? Do we still get something Gaussian? My instinct says no but I cannot see why the CLT would not work the same.

Best Answer

$\newcommand{\X}{\mathbf X}\newcommand{\x}{\mathbf x}\newcommand{\de}{\delta}\newcommand{\R}{\mathbb R}$Your understanding of the purposes of the bootstrap is largely incorrect.

Here is what the foundational paper by Efron that introduced the bootstrap method says:

We discuss the following problem: given a random sample $\X=(X_1,X_2,\dots,X_n)$ from an unknown probability distribution $F$, estimate the sampling distribution of some prespecified random variable $R(\X,F)$, on the basis of the observed data $\x$.

We see that here the (say real-valued) random variable (r.v.) \begin{equation*} Y:=R(\X,F) \end{equation*} in question is, not a function of just the sample $\X$, but also of the unknown probability distribution $F$ (of each $X_i$).

To estimate the distribution of the r.v. $Y=R(\X,F)$ by the bootstrap method, we obtain (usually by computer simulation) a large number, say $B$, of (desirably/approximately) independent random samples \begin{equation*} \x^*_1=(x^*_{1,1},\dots,x^*_{1,n}),\dots,\x^*_B=(x^*_{B,1},\dots,x^*_{B,n}) \end{equation*} from the empirical distribution \begin{equation*} F_\x=\frac1n\sum_{i=1}^n\de_{x_i} \end{equation*} corresponding to the observed sample $\x=(x_1,\dots,x_n)$, where $\de_x$ is the Dirac delta measure supported on the singleton set $\{x\}$. So, the $Bn$ r.v.'s $x^*_{j,i}$ with $j=1,\dots,B$ and $i=1,\dots,n$ are (desirably/approximately) iid each with the distribution $F_\x$. Here $n$ is large enough so that the empirical distribution $F_\x$ be close enough to the true but unknown distribution $F$, and $B$ is very large -- which is affordable because of the large computer power we have nowadays.

Then, for each $j=1,\dots,B$, in the formula $Y=R(\X,F)$ we replace the r.v. $\X$ and the unknown distribution $F$ by the known $\x^*_j$ and $F_\x$, respectively, to get
\begin{equation*} y^*_1:=R(\x^*_1,F_\x),\dots,y^*_B:=R(\x^*_B,F_\x). \end{equation*} Because the empirical distribution $F_\x$ is close enough to the true but unknown distribution $F$, the empirical distribution \begin{equation*} \frac1B\sum_{j=1}^B\de_{y^*_j} \tag{1}\label{1} \end{equation*} will be somewhat close to the desired unknown distribution of $Y=R(\X,F)$ -- if the function $R$ is continuous in an appropriate sense. The empirical distribution \eqref{1} is called the bootstrap distribution of $Y=R(\X,F)$.

The simplest example of the function $R$ in Efron's paper is given by \begin{equation*} R(\X,F)=\bar X-\int_\R x\,F(dx), \end{equation*} where $\bar X:=\frac1n\sum_{i=1}^n X_i$ and $F$ is the Bernoulli distribution with an unknown parameter. As noted by Efron, clearly in this case one does not need any simulation to find/estimate the mean and the variance of the distribution of $R(\x^*_1,F_\x)$ -- they are obviously $0$ and $\bar x(1-\bar x)$, respectively, where, of course, $\bar x:=\frac1n\sum_{i=1}^n x_i$.


Of course, neither the true distribution of $Y=R(\X,F)$ nor its bootstrap distribution will be even approximately normal in general. However, recall that the bootstrap distribution is random (or, at least, pseudo-random), even for a given realization $\x$ of the random sample $\X$ -- because the bootstrap distribution depends on the (pseudo-)random bootstrap samples $\x^*_1,\dots,\x^*_B$. Therefore, for each realization $\x$ of $\X$, the bootstrap distribution of $R(\X,F)$ (being the empirical distribution \eqref{1} (based on $y^*_1,\dots,y^*_B$)) will satisfy a central limit theorem for empirical measures -- see e.g. Dudley and subsequent papers.

Related Question