Solved – Bootstrapping confidence interval from a regression prediction

bootstrapconfidence intervalmachine learningregressionself-study

For homework, I was given data to create/train a predictor that uses lasso regression. I create the predictor and train it using the lasso python library from scikit learn.

So now I have this predictor that when given input can predict the output.

The second questions was to "Extend your predictor to report the confidence interval of the prediction by using the bootstrapping method."

I've looked around and found examples of people doing this for the mean and other things.

But I am completely lost on how I'm suppose to do it for a prediction. I am trying to use the scikit-bootstrap library.

The course staff is being extremely unresponsive, so any help is appreciated. Thank you.

Best Answer

Bootstrapping refers to resample your data with replacement. That is, instead of fitting your model to the original X and y, you fit your model to resampled versions of X and y for multiple times.

Thus, you get n slightly different models which you can use to create a confidence interval. Here is a visual example of such an interval.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Create toy data 
x = np.linspace(0, 10, 20)
y = x + (np.random.rand(len(x)) * 10)

# Extend x data to contain another row vector of 1s
X = np.vstack([x, np.ones(len(x))]).T

plt.figure(figsize=(12,8))
for i in range(0, 500):
    sample_index = np.random.choice(range(0, len(y)), len(y))

    X_samples = X[sample_index]
    y_samples = y[sample_index]    

    lr = LinearRegression()
    lr.fit(X_samples, y_samples)
    plt.plot(x, lr.predict(X), color='grey', alpha=0.2, zorder=1)

plt.scatter(x,y, marker='o', color='orange', zorder=4)

lr = LinearRegression()
lr.fit(X, y)
plt.plot(x, lr.predict(X), color='red', zorder=5)

Related Solutions

Solved – the meaning of a confidence interval taken from bootstrapped resamples

If the bootstrapping procedure and the formation of the confidence interval were performed correctly, it means the same as any other confidence interval. From a frequentist perspective, a 95% CI implies that if the entire study were repeated identically ad infinitum, 95% of such confidence intervals formed in this manner will include the true value. Of course, in your study, or in any given individual study, the confidence interval either will include the true value or not, but you won't know which. To understand these ideas further, it may help you to read my answer here: Why does a 95% Confidence Interval (CI) not imply a 95% chance of containing the mean?

Regarding your further questions, the 'true value' refers to the actual parameter of the relevant population. (Samples don't have parameters, they have statistics; e.g., the sample mean, $\bar x$, is a sample statistic, but the population mean, $\mu$, is a population parameter.) As to how we know this, in practice we don't. You are correct that we are relying on some assumptions--we always are. If those assumptions are correct, it can be proven that the properties hold. This was the point of Efron's work back in the late 1970's and early 1980's, but the math is difficult for most people to follow. For a somewhat mathematical explanation of the bootstrap, see @StasK's answer here: Explaining to laypeople why bootstrapping works . For a quick demonstration short of the math, consider the following simulation using R:

# a function to perform bootstrapping
boot.mean.sampling.distribution = function(raw.data, B=1000){
  # this function will take 1,000 (by default) bootsamples calculate the mean of 
  # each one, store it, & return the bootstrapped sampling distribution of the mean

  boot.dist = vector(length=B)     # this will store the means
  N         = length(raw.data)     # this is the N from your data
  for(i in 1:B){
    boot.sample  = sample(x=raw.data, size=N, replace=TRUE)
    boot.dist[i] = mean(boot.sample)
  }
  boot.dist = sort(boot.dist)
  return(boot.dist)
}

# simulate bootstrapped CI from a population w/ true mean = 0 on each pass through
# the loop, we will get a sample of data from the population, get the bootstrapped 
# sampling distribution of the mean, & see if the population mean is included in the
# 95% confidence interval implied by that sampling distribution

set.seed(00)                       # this makes the simulation reproducible
includes = vector(length=1000)     # this will store our results
for(i in 1:1000){
  sim.data    = rnorm(100, mean=0, sd=1)
  boot.dist   = boot.mean.sampling.distribution(raw.data=sim.data)
  includes[i] = boot.dist[25]<0 & 0<boot.dist[976]
}
mean(includes)     # this tells us the % of CIs that included the true mean
[1] 0.952

Solved – Using Bootstrap to estimate confidence interval of the standard deviation

I would say that since sample mean is $\overline{X}=\frac{\sum{X_i}}{n}$, its convergence rate is 1/n, which is also the convergence rate of sample variance. But the convergence rate of sample standard deviation $ 1/\sqrt{n} $.

Best Answer

Related Solutions

Solved – the meaning of a confidence interval taken from bootstrapped resamples

Solved – Using Bootstrap to estimate confidence interval of the standard deviation

Related Question