Bootstrap – Understanding the Issue of Overfitting in Bootstrap Samples

bootstrapfinite-populationsamplesample-sizesmall-sample

Suppose one performs the so-called non-parametric bootstrap by drawing $B$ samples of size $n$ each from the original $n$ observations with replacement. I believe this procedure is equivalent to estimating the cumulative distribution function by the empirical cdf:

http://en.wikipedia.org/wiki/Empirical_distribution_function

and then obtaining the bootstrap samples by simulating $n$ observations from the estimated cdf $B$ times in a row.

If I am right in this, then one has to address the issue of overfitting, because the empirical cdf has about N parameters. Of course, asymptotically it converges to the population cdf, but what about finite samples? E.g. if I were to tell you that I have 100 observations and I am going to estimate the cdf as $N(\mu, \sigma^2)$ with two parameters, you wouldn't be alarmed. However, if the number of parameters were to go up to 100, it wouldn't seem reasonable at all.

Likewise, when one employs a standard multiple linear regression, the distribution of the error term is estimated as $N(0, \sigma^2)$. If one decides to switch to bootstrapping the residuals, he has to realize that now there are about $n$ parameters used just to handle the error term distribution.

Could you please direct me to some sources that address this issue explicitly, or tell me why it's not an issue if you think I got it wrong.

Best Answer

i am not completely sure i understand your question right... i am assuming you are interested in the order of convergence?

because the empirical cdf has about N parameters. Of course, asymptotically it converges to the population cdf, but what about finite samples?

Have you read any of the basics on bootstrap theory? The Problem is that it gets pretty wild (mathematically) pretty quickly.

Anyway, i recommend having a look at

van der Vaart "Asymptotic Statistics" chapter 23.

Hall "Bootstrap and Edgeworth expansions" (lengthy but concise and less handwaving than van der Vaart i'd say)

for the basics.

Chernick "Bootstrap Methods" is more aimed at users rather than mathematicians but has a section on "where bootstrap fails".

The classical Efron/Tibshirani has little on why bootstrap actually works...

Related Solutions

Bootstrap – Questions on Parametric and Non-Parametric Bootstrap in Frequentist Theory

The answer given by miura is not entirely accurate so I am answering this old question for posterity:

(2). These are very different things. The empirical cdf is an estimate of the CDF (distribution) which generated the data. Precisely, it is the discrete CDF which assigns probability $1/n$ to each observed data point, $\hat{F}(x) = \frac{1}{n}\sum_{i=1}^n I(X_i\leq x)$, for each $x$. This estimator converges to the true cdf: $\hat{F}(x) \to F(x) = P(X_i\leq x)$ almost surely for each $x$ (in fact uniformly).

The sampling distribution of a statistic $T$ is instead the distribution of the statistic you would expect to see under repeated experimentation. That is, you perform your experiment once and collect data ${X_1,\ldots,X_n}$. $T$ is a function of your data: $T = T(X_1,\ldots,X_n)$. Now, suppose you repeat the experiment, and collect data ${X'_1,\ldots,X'_n}$. Recalculating T on the new sample gives $T' = T({X'_1,\ldots,X'_n})$. If we collected 100 samples we would have 100 estimates of $T$. These observations of $T$ form the sampling distribution of $T$. It is a true distribution. As the number of experiments goes to infinity its mean converges to $E(T)$ and its variance to $Var(T)$.

In general of course we don't repeat experiments like this, we only ever see one instance of $T$. Figuring out what the variance of $T$ is from a single observation is very difficult if you don't know the underlying probability function of $T$ a priori. Bootstrapping is a way to estimate that sampling distribution of $T$ by artificially running "new experiments" on which to calculate new instances of $T$. Each new sample is actually just a resample from the original data. That this provides you with more information than you have in the original data is mysterious and totally awesome.

(1). You are correct--you would not do this. The author is trying to motivate the parametric bootstrap by describing it as doing "what you would do if you knew the distribution" but substituting a very good estimator of the distribution function--the empirical cdf.

For example, suppose you know that your test statistic $T$ is normally distributed with mean zero, variance one. How would you estimate the sampling distribution of $T$? Well, since you know the distribution, a silly and redundant way to estimate the sampling distribution is to use R to generate 10,000 or so standard normal random variables, then take their sample mean and variance, and use these as our estimates of the mean and variance of the sampling distribution of $T$.

If we don't know a priori the parameters of $T$, but we do know that it's normally distributed, what we can do instead is generate 10,000 or so samples from the empirical cdf, calculate $T$ on each of them, then take the sample mean and variance of these 10,000 $T$s, and use them as our estimates of the expected value and variance of $T$. Since the empirical cdf is a good estimator of the true cdf, the sample parameters should converge to the true parameters. This is the parametric bootstrap: you posit a model on the statistic you want to estimate. The model is indexed by a parameter, e.g. $(\mu, \sigma)$, which you estimate from repeated sampling from the ecdf.

(3). The nonparametric bootstrap doesn't even require you to know a priori that $T$ is normally distributed. Instead, you simply draw repeated samples from the ecdf, and calculate $T$ on each one. After you've drawn 10,000 or so samples and calculated 10,000 $T$s, you can plot a histogram of your estimates. This is a visualization of the sampling distribution of $T$. The nonparametric bootstrap won't tell you that the sampling distribution is normal, or gamma, or so on, but it allows you to estimate the sampling distribution (usually) as precisely as needed. It makes fewer assumptions and provides less information than the parametric bootstrap. It is less precise when the parametric assumption is true but more accurate when it is false. Which one you use in each situation you encounter depends entirely on context. Admittedly more people are familiar with the nonparametric bootstrap but frequently a weak parametric assumption makes a completely intractable model amenable to estimation, which is lovely.

Bootstrap Methodology – How It Addresses Small Sample Size Issues

I remember reading that using the percentile confidence interval for bootstrapping is equivalent to using a Z interval instead of a T interval and using $n$ instead of $n-1$ for the denominator. Unfortunately I don't remember where I read this and could not find a reference in my quick searches. These differences don't matter much when n is large (and the advantages of the bootstrap outweigh these minor problems when $n$ is large), but with small $n$ this can cause problems. Here is some R code to simulate and compare:

simfun <- function(n=5) {
    x <- rnorm(n)
    m.x <- mean(x)
    s.x <- sd(x)
    z <- m.x/(1/sqrt(n))
    t <- m.x/(s.x/sqrt(n))
    b <- replicate(10000, mean(sample(x, replace=TRUE)))
    c( t=abs(t) > qt(0.975,n-1), z=abs(z) > qnorm(0.975),
        z2 = abs(t) > qnorm(0.975), 
        b= (0 < quantile(b, 0.025)) | (0 > quantile(b, 0.975))
     )
}

out <- replicate(10000, simfun())
rowMeans(out)

My results for one run are:

     t      z     z2 b.2.5% 
0.0486 0.0493 0.1199 0.1631

So we can see that using the t-test and the z-test (with the true population standard deviation) both give a type I error rate that is essentially $\alpha$ as designed. The improper z test (dividing by sample standard deviation, but using Z critical value instead of T) rejects the null more than twice as often as it should. Now to the bootstrap, it is rejecting the null 3 times as often as it should (looking if 0, the true mean, is in the interval or not), so for this small sample size the simple bootstrap is not sized properly and therefore does not fix problems (and this is when the data is optimally normal). The improved bootstrap intervals (BCa etc.) will probably do better, but this should raise some concern about using bootstrapping as a panacea for small sample sizes.

Best Answer

Related Solutions

Bootstrap – Questions on Parametric and Non-Parametric Bootstrap in Frequentist Theory

Bootstrap Methodology – How It Addresses Small Sample Size Issues

Related Question