Bootstrap for Regression – Two Ways to Estimate Confidence Intervals of Coefficients

bootstrapregression

I am applying a linear model to my data:
$$
y_{i}=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i}, \quad\epsilon_{i} \sim N(0,\sigma^{2}).
$$

I would like to estimate the confidence interval (CI) of the coefficients ($\beta_{0}$, $\beta_{1}$) using bootstrap method. There are two ways that I can apply the bootstrap method:

Sample paired response-predictor: Randomly resample pairs of $y_{i}-x_{i}$, and apply linear regression to each run. After $m$ runs, we obtain a collection of estimated coefficients ${\hat{\beta_{j}}}, j=1,…m$. Finally, compute the quantile of ${\hat{\beta_{j}}}$.
Sample error: First apply linear regression on the original observed data, from this model we obtain $\hat{\beta_{o}}$ and the error $\epsilon_{i}$. Afterwards, randomly resample the error $\epsilon^{*}_{i}$ and compute the new data with $\hat{\beta_{o}}$ and $y^{*}_{i}=\hat{\beta_{o}}x_{i}+\epsilon^{*}_{i}$. Apply once again linear regression. After $m$ runs, we obtain a collection of estimated coefficeints ${\hat{\beta_{j}}}, j=1,…,m$. Finally, compute the quantile of ${\hat{\beta_{j}}}$.

My questions are:

How are these two methods different?
Under which assumption are these two methods giving the same result?

Best Answer

If the response-predictor pairs have been obtained from a population by random sample, it is safe to use case/random-x/your-first resampling scheme. If predictors were controlled for, or the values of the predictors were set by the experimenter, you may consider using residual/model-based/fixed-x/your-second resampling scheme.

How do the two differ? An introduction to the bootstrap with applications in R by Davison and Kounen has a discussion pertinent to this question (see p.9). See also the R code in this appendix by John Fox, particularly functions boot.huber on p.5 for the random-x scheme and boot.huber.fixed on p.10 for the fixed-x scheme. While in the lecture notes by Shalizi the two schemes are applied to different datasets/problems, Fox's appendix illustrate how little difference the two schemes may often make.

When can the two be expected to deliver near identical results? One situation is when the regression model is correctly specified, e.g., there is no unmodelled nonlinearity and the usual regression assumptions (e.g., iid errors, no outliers) are satisfied. See chapter 21 of Fox's book (in which the aforementioned appendix with the R code indirectly belongs), particularly the discussion on page 598 and exercise 21.3. entitled "Random versus ﬁxed resampling in regression". To quote from the book

By randomly reattaching resampled residuals to ﬁtted values, the [fixed-x/model-based]
procedure implicitly assumes that the errors are identically distributed. If, for
example, the true errors have non-constant variance, then this property will not be  
reﬂected in the resampled residuals. Likewise, the unique impact of a high-leverage
outlier will be lost to the resampling.

You will also learn from that discussion why fixed-x bootstrap implicitly assumes that the functional form of the model is correct (even though no assumption is made about the shape of the error distribution).

See also slide 12 of this talk for Society Of Actuaries in Ireland by Derek Bain. It also has an illustration of what should be considered "the same result":

The approach of re-sampling cases to generate pseudo data is the more usual form of   
bootstrapping. The approach is robust in that if an incorrect model is fitted an
appropriate measure of parameter meter uncertainty is still obtained. However re
sampling residuals is more efficient if the correct model has been fitted.

The graphs shows both approaches in estimating the variance of a 26 point data sample
mean and a 52 point sample mean. In the larger sample the two approaches are  
equivalent.

Related Solutions

Solved – Calculate the MSE for a Linear Regression Model using a Bootstrap

I don't use R enough to comment on the code.
The rms package has the validate.ols function that will perform the bootstrap for you. You can compare results.
The bootstrap in this setting is used to validate the model building process - to assess for overfitting, and provide an valid estimate of out-of-sample performance.
The bootstrap is more efficient than split sample validation, which provides reliable/stable estimates only with fairly large sample sizes. If you're using the bootstrap, you're not using test/training sets.
Traditionally one builds the model on the bootstrap sample and tests it on the original sample.
Comparing internal validation methods: http://www.ncbi.nlm.nih.gov/pubmed/11470385

Solved – Bootstrap estimate for standard error in linear regression

Sure a worse fit gives larger residuals, which gives a larger standard error estimate. But this is not why we compute the standard error estimate. If all you care about is fit, you don't need to bother with the standard error of $\hat \beta$ at all. Just use $\hat\sigma^2$ directly, which is the MSE of your model. We use $\text{s.e}(\hat \beta)$ for inference: to say something about how well we have estimated $\beta$. In short to construct confidence intervals around our point estimate.

If we want to trust these intervals and their relation to the population of $\beta$s, $\text{s.e.}(\hat\beta)$ should be correct. It should be the standard deviation of whatever distribution $\hat \beta$ has. If the model is specified correctly, we can use the standard formulas to estimate $\text{s.e.}(\hat\beta)$ directly, and we know exactly which properties $\hat\beta$ has. These properties and formulas — and hence our inferences about $\hat\beta$ — are directly derived from the model assumptions. If the model isn't correctly specified, all our beautiful and clean theory goes out the window.

Elsewhere in the ISLR book you can read that an approximate 95% confidence interval for $\hat \beta$ is

$$[\hat\beta - 2\cdot\text{s.e.}(\hat\beta), \hat\beta + 2\cdot\text{s.e.}(\hat\beta)].$$

This is true if you have a good standard error estimate, but if the estimate is too small/large, the confidence interval is too tight/wide. You can no longer have 95% confidence in your 95% confidence interval.

A simulation study

Below is an illustration in code and figures. The examples in ISLR use some data that look quadratic. I will simulate some data so that we know what the truth is. My true model is $y = 4 + 5x -3x^2 + \epsilon$, where $\epsilon \sim N(0,1)$. This is classic linear regression, the big assumption is that errors should conditional on $x$ be iid from a normal distribution with mean zero. A quadratic model is the perfect fit. A linear model is a poor fit.

The figure below shows some simulated data in grey, the true model in black, and a fitted linear regression in red. The errors are not mean-zero normal conditional on $x$: they are consistently below zero to the right and to the left, and consistently above zero in the middle.

library(boot)
library(plyr)

set.seed(22042017) # for reproducibility
generate_data <- function(nsamples=100) {
  x <- rnorm(nsamples,mean=.75, sd=.5)
  y <-  4 + 5*x -3*x^2 + rnorm(length(x), mean=0, sd = 1)
  data.frame(x,y)
}


dat <- generate_data()
plot(dat, pch=20, col="grey")
curve(4 + 5*x -3*x^2, add=T, lwd=1.5)
abline(lm(y~x, data=dat), col="red", lwd=1.5)

To explore the behavior of a bootstrap estimate of s.e. as compared to the standard parametric estimate I have included a small simulation. First I generate a data set similar to above, I then estimate standard error for the coefficient of $x$ with both the parametric assumptions in lm() and with the bootstrap. This gives us a i) distribution over how the standard estimate behaves here and ii) a distribution over how the bootstrap estimate behaves. I also calculate the "true" s.e. of $\hat \beta_x$ by repeatedly generating data and calculating the beta. I can then take the standard deviation of all these betas as the truth.

# for the bootstrap
bootfun <- function(data, index) {
  coef(lm(y~x, data=dat, subset = index))
}

# simulation: compare 
experiment <- plyr::raply(1000, {
  dat <- generate_data()

  sig_lm <- coef(summary(lm(y~x, data=dat)))[2, 2]
  sig_boot <- sd(boot(dat, bootfun, 100)$t[, 2])

  c(lm=sig_lm, boot=sig_boot)
})


# The "real" s.e. of beta estimates
sampling <- raply(10000, function() {
  dat <- generate_data()
  coef(lm(y~x, data=dat))[2]
})
truth <- sd(sampling)


d_lm <- density(experiment[, 1])
d_bt <- density(experiment[, 2])

xl <- range(d_lm$x, d_bt$x)
yl <- range(d_lm$y, d_bt$y)

plot(d_lm, xlim=xl, ylim=yl)
lines(d_bt, lty=2)
abline(v=truth, col="red")

^{Created on 2019-10-25 by the reprex package (v0.2.1)}

Standard estimate is the solid line, bootstrap is the dashed line, in red we see the truth. The standard estimate greatly underestimates the and the bootstrap somewhat underestimates. The bootstrap gets much closer on average and at least some times is in the right area.

Best Answer

Related Solutions

Solved – Calculate the MSE for a Linear Regression Model using a Bootstrap

Solved – Bootstrap estimate for standard error in linear regression

A simulation study

Related Question