Confidence Interval – How to Assess Coverage Probability in Simulations

confidence intervalcoverage-probabilitysimulation

I would like to know if my simulation approach to find the coverage for a confidence interval of a prediction $\boldsymbol{\beta}^T\boldsymbol{X}_N$ is correct

  1. I generated a dataset of $n$ samples of covariates $\boldsymbol{X} \in \mathbb{R}^p$ and $Y \in \mathbb{R}$ that follow the linear model $Y_i = \boldsymbol{\beta}^T\boldsymbol{X}_i + \varepsilon_i$ for $i=1,\dots,n$. So I have a design matrix $\mathbb{X} \in \mathbb{R}^{n \times p}$ and a response vector $\mathbb{Y} \in \mathbb{R}^{n}$. Here I set $n = 512$ and $p = 1024$. (the data were generated as a multivariate standar normal)
  2. I created a new independent observation $\boldsymbol{X}_N \in \mathbb{R}^p$, $\boldsymbol{X}_N \sim N(0,I_p)$
  3. Compute $\widehat{\boldsymbol{\beta}} \in \mathbb{R}^p$ for the linear model
  4. Find the true value of $\boldsymbol{\beta}^T\boldsymbol{X}_N$ (since I can compute the true parameter $\boldsymbol{\beta}$)
  5. Compute an estimator of the variance $\hat{V} = \text{var}(\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N)$
  6. Compute the confidence interval for $\boldsymbol{\beta}^T\boldsymbol{X}_N$ as $(\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N \pm z_{\alpha/2}\hat{V}^{1/2})$ assuming asymptotic normality.

Now I'm not sure how to proceed. Should I repeat the process from (3) or generate another dataset? Any help would be much appreciated.

Edit: I'm interested in the behavior of $\widehat{\boldsymbol{\beta}}^T\boldsymbol{X}_N$ since this is a univariate term. The new observation $\boldsymbol{X}_N$ is fixed once is generated. So yes, I should say prediction inverval, but the rest of the question remains.

Best Answer

As requested by Demetri's in the comments to his (incorrect) answer, here is R code correctly estimating the coverage of both prediction and confidence intervals by simulation. For simplicity only a single covariate is considered.

coverage <- function(
  n, # sample size
  x = rnorm(n), # covariate 
  beta = rnorm(2), # true parameter values
  sigma = 1, # true error variance
  xnew = 1, # new x-value for which to predict
  nsim = 1e+4 # number of replicates to simulate
) {
  ci.hits <- 0
  pi.hits <- 0
  for (i in 1:nsim) {
    # simulate the observed data
    y <- beta[1] + beta[2]*x + rnorm(n, sd=sigma)
    # fit the model
    mod <- lm(y ~ x)
    # simulate a new observation to predict
    yhat <- beta[1] + beta[2]*xnew
    ynew <- yhat + rnorm(1, sd=sigma)
    # compute confidence and prediction intervals
    pi <- predict(mod, newdata=data.frame(x=xnew), interval="prediction")
    if (pi[1,"lwr"] < ynew & ynew < pi[1,"upr"])
      pi.hits <- pi.hits + 1
    ci <- predict(mod, newdata=data.frame(x=xnew), interval="confidence")
    if (ci[1,"lwr"] < yhat & yhat < ci[1,"upr"])
      ci.hits <- ci.hits + 1
  }
  list(pi.coverage = pi.hits/nsim, ci.coverage=ci.hits/nsim)
}

Varying for example the sample size $n$ (see simulations below) the coverage never deviate significantly from the nominal level of 0.95. This is of course as expected as it is well known that these intervals are exact (see any mathematical statistics textbook on linear regression).

> set.seed(1)
> coverage(n = 5)
$pi.coverage
[1] 0.952

$ci.coverage
[1] 0.9501

> coverage(n = 10)
$pi.coverage
[1] 0.9478

$ci.coverage
[1] 0.9507

> coverage(n = 20)
$pi.coverage
[1] 0.9532

$ci.coverage
[1] 0.9509