Solved – Bootstrap confidence intervals – how many replications to choose

bootstrapconfidence interval

I applied a bootstrap-process to calculate confidence intervalls for the paramters of a multiple lineare regression.

In R it's pretty simple to implement (functions: 'boot' and 'boot.ci') but I still have two comprehension problems:

Why does it make sense to perform a bootstrap procedure before calculating the confidence intervals? Will they be more precise? And if so, can anyone explain why?
How can I decide which number of replications is a good number for calculating confidence intervalls? 100? 1000? 10000?

I would really appreciate any help!

Best Answer

Why does it make sense to perform a bootstrap procedure before calculating the confidence intervals? Will they be more precise? And if so, can anyone explain why?

You can calculate bootstrap confidence intervals for complex situations, i.e. properties ("statistics") that are not easily accessible analytically. I'm thinking of things like bootstrapping generalization error of a predictive model*.

In other words, bootstrapping may still be possible in situations where you have no good assumption which distribution to base your confidence intervals on.

The choice parametric (analytical confidence interval based on known distribution) vs. non-parametric bootstrap is a trade-off:

good parametric statistics will be more precise. But they may be totally off if the assumptions are violated (i.e. the distribution you chose was not appropriate).
bootstrap is less precise (for a given number of original cases) but does not rely on particular distribution assumptions, so there's less danger of getting that part wrong*.

How can I decide which number of replications is a good number for calculating confidence intervalls? 100? 1000? 10000?

@MartenBuuis already gave you some idea how to approach this question. Here's another, very pragmatic one:

Bootstrap, say, with nboot = 100 replications.
repeat this 10 times
check variability of the bootstrap results.
if the variation you observe over the repetitions of the bootstrapping calculation is acceptable for your application, fuse the 10x100 calculations and use the result of that nboot = 10x100 = 1000 replications.
If they are not sufficiently precise, fuse the 10x100 calculations, go back to step 1 and 2 with nboot = 1000 replications.

You get the idea.

Related Solutions

Confidence Intervals – Using Monte Carlo and Nonparametric Methods for Confidence Intervals of Mean Estimate

I think you are proceeding incorrectly. You should be constructing :

d=c(x1,x2,x3)

And then examining the statistics of interest before applying them to the samples.

Solved – Confidence Intervals Around a Mean: biased (non-centered) confidence interval? (an exercise using R)

First of all, I agree with the comments left by heropup. I'll add some details.

The reason why your simulation breaks down may be a little subtle. At least I spend some time reading your code to find the source of the problem. Please notice, that you only simulate once for each of the cases. Then your CIs functions resample this initial data set. This clearly gives a lot of dependence between all of the samples. For instance, if you draw a sample of 1000 of the original data set, there is only one way to do this. If you draw a sample of 999 an overwhelming majority of the data set will still be the same between resamples. You'll need to do independent resampling. Otherwise, the 100 samples are essentially the same when you let $n$ get large.

Turning to your question, a confidence interval as the ones you do above are based on a distributional assumption, for instance your observations are normally distributed. If that's the case the confidence interval will be 'centered' in the sense you talk about when you construct the confidence interval symmetrically. This is evident from the symmetry of the distribution and of the procedure of constructing a confidence interval. In the above, you also calculate a confidence interval when the distributional assumption you make is not correct. Then an confidence interval need not be centered, even if the distribution assumed is symmetric. This can be seen simulated observations from a chi squared distribution and calculating confidence intervals based on a normal distribution. However, using a central limit theorem we can argue that the mean of the chi squared observations will be approximately normally distributed for large enough sample sizes.

Finally, I just want to note that a confidence interval (or more generally, a confidence set) is basically and loosely speaking just some subset of a parameter set such that when you calculate this set a lot of times (hypothetically) it will contain the true parameter value for example 95% of the times. There's no claim of this set being 'centered' or symmetric around a parameter estimate. It can be chosen to have all sorts of strange forms. This is just not very intuitive and most of the time not very helpful.

Best Answer

Related Solutions

Confidence Intervals – Using Monte Carlo and Nonparametric Methods for Confidence Intervals of Mean Estimate

Solved – Confidence Intervals Around a Mean: biased (non-centered) confidence interval? (an exercise using R)

Related Question