Solved – Find the approximate standard error of the bootstrap distribution

bootstraperrorself-study

I am studying for an exam, not a homework question.

According to a survey taken of 500 randomly selected US high school students, 62.9% plan to attend a four year university. A bootstrap distribution is created to find a 95% confidence interval for the true proportion of US high school students who plan to attend a four year university.

What would be the approximate standard error of the bootstrap distribution?
A. 0.0216
B. 0.0223
C. 0.0005
D. 0.0135

The actual answer is A, but I'm not sure how to get there.

What I have done:

Find the standard deviation of the mean. At 95% confidence the standard deviation should be 1.96
SE = 1.96 / √n; the n = 62.9% of 500 which is 314.
I get 0.1106.

Not sure what I am doing wrong.

Best Answer

Let $\bar x = \frac1n \sum_{i=1}^n x_i$ where $x_i$ is the indicator variable of whether the student attends 4-year university. Your data is $\bar x = 0.63$ and $n=500$.

Let $\mathbb P_n = \frac1n \sum_{i=1}^n \delta_{x_i}$ be the empirical distribution of your observations. This simplifies to $$ \mathbb P_n = \bar x \delta_1 + (1-\bar x) \delta_0 $$ which is a Bernoulli distribution with parameter $\bar x$.

Let $x^*_i \sim \mathbb P_n$ be iid draws from this empirical distribution for $i=1,\dots,n$. $(x^*_1,\dots,x^*_n)$ is the bootstrap sample. Your bootstrap estimate of the parameter of interest is $\frac1n \sum_{i=1}^n x^*_i$ which has the mean $\bar x$ under the empirical distribution. The variance of this estimate (which will be your bootstrap estimate of the variance of the original estimator) is $$ \mathbb P_n \Big(\frac1n \sum_{i=1}^n x^*_i - \bar x\Big)^2 = \frac1n \mathbb P_n (x^*_1 -\bar x)^2 = \frac{\bar x(1-\bar x)}{n} $$ where $\mathbb P_n$ above means the expectation under the empirical measure $\mathbb P_n$. (If this is too much notation just mentally replace it with $\mathbb E$.) The first equality is by iid nature of $\{x^*_i\}$. The second equality is by the simple formula for the variance of a Bernoulli variable.

Your bootstrap estimate of the standard error is then $$ \sqrt{\frac{\bar x(1-\bar x)}{n}} = \sqrt{\frac{0.629(1-.629)}{500} } = 0.0216 $$

EDIT: Also, as far as I understand, this is the exact standard error (or standard deviation if you will) of the bootstrap distribution of the sample mean. There is no need to approximate it, since it is obtainable in closed form in this case.

Related Solutions

Solved – Logistic regression with bootstrap, how to interpret high standard errors and choose coefficient

I don't follow your code, you call your data different things in different places, I don't see your function being used anywhere, etc. Setting that aside, I'm not sure there is a big problem with your model other than the fact that you don't have much data (I gather N = 17, which is pretty small). I don't think your standard errors would be that problematic if you had a more typical sample size.

Moreover, your model seems impressively good to me for a logistic regression model with so few data to work with. The reason neither variable is significant is clearly because they are correlated. This will expand your SEs, but wouldn't be bad if you had more data. As it is, your SEs are about one third larger than they would have been if your data were perfectly uncorrelated:

1/(1-.49^2)
# [1] 1.315963

That means the model doesn't know which of the two variables should be given credit for predicting the response. Nonetheless, there is good predictive ability amongst those variables somewhere, as can be seen by their combined significance:

1-pchisq(23.508-14.893, 2)
# [1] 0.01346718

As far as bootstrapping goes, it is used to get an estimate of the nature of the sampling distribution that doesn't rely on assumptions about normality. It may help you to read this excellent CV thread: Explaining to laypeople why bootstrapping works.

Time-Series – How to Perform Bootstrap Sampling with Size Greater than Original Sample for Volatility Forecasting in Monte Carlo and GARCH

The objective of bootstrapping is (usually) to get some idea of the distribution of the parameter estimate(s). Since the parameter estimates were formed on the basis of a sample of size $N$, their distribution is conditional upon that sample size. Resampling to larger or smaller sample sizes will, consequently. give a more distorted view of the distribution of the parameter estimates than resampling with a sample size of $N$.

In this case, however, you are not actually performing the Efron bootstrap. You are simply generating simulated values of the sample path based upon the 500 estimated errors. Consequently, the issue with whether or not you can generate more than 500 such sample paths is moot; you can, as Johan points out, generate as many as you want.

Since you are basing all your results on the one set of initial parameter estimates, the sample paths are conditional upon that set being correct. The variability in the end result does not take into account parameter uncertainty, and it is this additional variability that the Efron bootstrap is designed to help with. A process that incorporates the bootstrap might be:

Select a sample (with replacement) of 500 values from the initial set of standardized residuals (this 500 is the "500" that gave you so much trouble in your thinking about the problem and that Efron refers to in the book,)
Calculate a simulated version of the original series using those standardized residuals and your initial parameter estimates,
Re-estimate the parameters using the simulated version of the original series,
Use the standardized residuals from the re-estimated parameters and the original data to generate some (smallish) number $M$ of future sample paths,
If you've generated enough overall sample paths, exit, else go to 1.

Steps 1 through 3 are where the Efron bootstrap comes into play. Step 4 is the simulation as it is currently performed. Note that at each iteration you are generating new standardized residuals for use in the simulator; this will lessen the dependence of the results on the initial set of parameter estimates / standardized residuals and take into account, to some extent, the inaccuracy in the parameter estimates themselves.

If you generate $K$ bootstrap estimates in steps 1 and 2, you will have generated $KM$ total sample paths at the end of the exercise. How you should divide those between $K$ and $M$ depends to some extent on the various computational burdens involved but also upon how the contributions to randomness are split between parameter estimation error and sample path variability. As a general rule, the more accurate your parameter estimates are, the smaller $K$ can be; conversely, the less the sample paths vary for a given value of the parameter estimates, the smaller $M$ can be.

Best Answer

Related Solutions

Solved – Logistic regression with bootstrap, how to interpret high standard errors and choose coefficient

Time-Series – How to Perform Bootstrap Sampling with Size Greater than Original Sample for Volatility Forecasting in Monte Carlo and GARCH

Related Question