Confidence Interval – How to Assess Coverage Probability in Simulations

confidence intervalcoverage-probabilitysimulation

I would like to know if my simulation approach to find the coverage for a confidence interval of a prediction $\boldsymbol{\beta}^T\boldsymbol{X}_N$ is correct

I generated a dataset of $n$ samples of covariates $\boldsymbol{X} \in \mathbb{R}^p$ and $Y \in \mathbb{R}$ that follow the linear model $Y_i = \boldsymbol{\beta}^T\boldsymbol{X}_i + \varepsilon_i$ for $i=1,\dots,n$. So I have a design matrix $\mathbb{X} \in \mathbb{R}^{n \times p}$ and a response vector $\mathbb{Y} \in \mathbb{R}^{n}$. Here I set $n = 512$ and $p = 1024$. (the data were generated as a multivariate standar normal)
I created a new independent observation $\boldsymbol{X}_N \in \mathbb{R}^p$, $\boldsymbol{X}_N \sim N(0,I_p)$
Compute $\widehat{\boldsymbol{\beta}} \in \mathbb{R}^p$ for the linear model
Find the true value of $\boldsymbol{\beta}^T\boldsymbol{X}_N$ (since I can compute the true parameter $\boldsymbol{\beta}$)
Compute an estimator of the variance $\hat{V} = \text{var}(\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N)$
Compute the confidence interval for $\boldsymbol{\beta}^T\boldsymbol{X}_N$ as $(\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N \pm z_{\alpha/2}\hat{V}^{1/2})$ assuming asymptotic normality.

Now I'm not sure how to proceed. Should I repeat the process from (3) or generate another dataset? Any help would be much appreciated.

Edit: I'm interested in the behavior of $\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N$ since this is a univariate term. The new observation $\boldsymbol{X}_N$ is fixed once is generated. So yes, I should say prediction inverval, but the rest of the question remains.

Best Answer

As requested by Demetri's in the comments to his (incorrect) answer, here is R code correctly estimating the coverage of both prediction and confidence intervals by simulation. For simplicity only a single covariate is considered.

coverage <- function(
  n, # sample size
  x = rnorm(n), # covariate 
  beta = rnorm(2), # true parameter values
  sigma = 1, # true error variance
  xnew = 1, # new x-value for which to predict
  nsim = 1e+4 # number of replicates to simulate
) {
  ci.hits <- 0
  pi.hits <- 0
  for (i in 1:nsim) {
    # simulate the observed data
    y <- beta[1] + beta[2]*x + rnorm(n, sd=sigma)
    # fit the model
    mod <- lm(y ~ x)
    # simulate a new observation to predict
    yhat <- beta[1] + beta[2]*xnew
    ynew <- yhat + rnorm(1, sd=sigma)
    # compute confidence and prediction intervals
    pi <- predict(mod, newdata=data.frame(x=xnew), interval="prediction")
    if (pi[1,"lwr"] < ynew & ynew < pi[1,"upr"])
      pi.hits <- pi.hits + 1
    ci <- predict(mod, newdata=data.frame(x=xnew), interval="confidence")
    if (ci[1,"lwr"] < yhat & yhat < ci[1,"upr"])
      ci.hits <- ci.hits + 1
  }
  list(pi.coverage = pi.hits/nsim, ci.coverage=ci.hits/nsim)
}

Varying for example the sample size $n$ (see simulations below) the coverage never deviate significantly from the nominal level of 0.95. This is of course as expected as it is well known that these intervals are exact (see any mathematical statistics textbook on linear regression).

> set.seed(1)
> coverage(n = 5)
$pi.coverage
[1] 0.952

$ci.coverage
[1] 0.9501

> coverage(n = 10)
$pi.coverage
[1] 0.9478

$ci.coverage
[1] 0.9507

> coverage(n = 20)
$pi.coverage
[1] 0.9532

$ci.coverage
[1] 0.9509

Related Solutions

Difference Between Confidence Intervals and Prediction Intervals – In-Depth Analysis

Your question isn't quite correct. A confidence interval gives a range for $\text{E}[y \mid x]$, as you say. A prediction interval gives a range for $y$ itself. Naturally, our best guess for $y$ is $\text{E}[y \mid x]$, so the intervals will both be centered around the same value, $x\hat{\beta}$.

As @Greg says, the standard errors are going to be different---we guess the expected value of $\text{E}[y \mid x]$ more precisely than we estimate $y$ itself. Estimating $y$ requires including the variance that comes from the true error term.

To illustrate the difference, imagine that we could get perfect estimates of our $\beta$ coefficients. Then, our estimate of $\text{E}[y \mid x]$ would be perfect. But we still wouldn't be sure what $y$ itself was because there is a true error term that we need to consider. Our confidence "interval" would just be a point because we estimate $\text{E}[y \mid x]$ exactly right, but our prediction interval would be wider because we take the true error term into account.

Hence, a prediction interval will be wider than a confidence interval.

Solved – Does more variables mean tighter confidence intervals

do I ALWAYS get a tighter confidence interval if I include more variables in my model?

Yes, you do (EDIT: ...basically. Subject to some caveats. See comments below). Here's why: adding more variables reduces the SSE and thereby the variance of the model, on which your confidence and prediction intervals depend. This even happens (to a lesser extent) when the variables you are adding are completely independent of the response:

a=rnorm(100)
b=rnorm(100)
c=rnorm(100)
d=rnorm(100)
e=rnorm(100)

summary(lm(a~b))$sigma       # 0.9634881
summary(lm(a~b+c))$sigma     # 0.961776
summary(lm(a~b+c+d))$sigma   # 0.9640104 (Went up a smidgen)
summary(lm(a~b+c+d+e))$sigma # 0.9588491 (and down we go again...)

But this does not mean you have a better model. In fact, this is how overfitting happens.

Consider the following example: let's say we draw a sample from a quadratic function with noise added.

enter image description here

A first order model will fit this poorly and have very high bias.

enter image description here

A second order model fits well, which is not surprising since this is how the data was generated in the first place.

enter image description here

But let's say we don't know that's how the data was generated, so we fit increasingly higher order models.

enter image description here

As the complexity of the model increases, we're able to capture more of the fluctuations in the data, effectively fitting our model to the noise, to patterns that aren't really there. With enough complexity, we can build a model that will go through each point in our data nearly exactly.

enter image description here

As you can see, as the order of the model increases, so does the fit. We can see this quantitatively by plotting the training error:

enter image description here

But if we draw more points from our generating function, we will observe the test error diverges rapidly.

enter image description here

The moral of the story is to be wary of overfitting your model. Don't just rely on metrics like adjusted-R2, consider validating your model against held out data (a "test" set) or evaluating your model using techniques like cross validation.

For posterity, here's the code for this tutorial:

set.seed(123)
xv = seq(-5,15,length.out=1e4)
X=sample(xv,20)
gen=function(v){v^2 + 7*rnorm(length(v))}
Y=gen(X)
df = data.frame(x=X,y=Y)
plot(X,Y)
lines(xv,xv^2, col="blue") # true model
legend("topleft", "True Model", lty=1, col="blue")    

build_formula = function(N){ 
  paste('y~poly(x,',N,',raw=T)')
}

deg=c(1,2,10,20)
formulas = sapply(deg[2:4], build_formula)
formulas = c('y~x', formulas)
pred = lapply(formulas
              ,function(f){
                predict(
                  lm(f, data=df)
                  ,newdata=list(x=xv))
                            })
# Progressively add fit lines to the plot
lapply(1:length(pred), function(i){
  plot(df, main=paste(deg[i],"-Degree"))
  lapply(1:i,function(n){
    lines(xv,pred[[n]], col=n)
  })
  })

# let's actually generate models from poly 1:20 to calculate MSE
deg=seq(1,20)
formulas = sapply(deg, build_formula)
pred.train = lapply(formulas
                   ,function(f){
                     predict(
                       lm(f, data=df)
                       ,newdata=list(x=df$x))
                   })

pred.test = lapply(formulas
              ,function(f){
                predict(
                  lm(f, data=df)
                  ,newdata=list(x=xv))
              })

rmse.train = unlist(lapply(pred.train,function(P){
  regr.eval(df$y,P, stats="rmse")
}))    

yv=gen(xv)
rmse.test = unlist(lapply(pred.test,function(P){
  regr.eval(yv,P, stats="rmse")
}))    

plot(rmse.train, type='l', col='blue'
     , main="Training Error"
     ,xlab="Model Complexity")

plot(rmse.test, type='l', col='red'
     , main="Train vs. Test Error"
     ,xlab="Model Complexity")
lines(rmse.train, type='l', col='blue')
legend("topleft", c("Test","Train"), lty=c(1,1), col=c("red","blue"))